This the multi-page printable view of this section. Click here to print.
Compute, Storage, and Networking Extensions
1 - Network Plugins
Network plugins in Kubernetes come in a few flavors:
- CNI plugins: adhere to the Container Network Interface (CNI) specification, designed for interoperability.
- Kubernetes follows the v0.4.0 release of the CNI specification.
- Kubenet plugin: implements basic
cbr0
using thebridge
andhost-local
CNI plugins
Installation
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it finds, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for Docker, as CRI manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
cni-bin-dir
: Kubelet probes this directory for plugins on startupnetwork-plugin
: The network plugin to use fromcni-bin-dir
. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this iscni
.
Network Plugin Requirements
Besides providing the NetworkPlugin
interface to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the net/bridge/bridge-nf-call-iptables
sysctl to 1
to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
By default if no kubelet network plugin is specified, the noop
plugin is used, which sets net/bridge/bridge-nf-call-iptables=1
to ensure simple configurations (like Docker with a bridge) work correctly with the iptables proxy.
CNI
The CNI plugin is selected by passing Kubelet the --network-plugin=cni
command-line option. Kubelet reads a file from --cni-conf-dir
(default /etc/cni/net.d
) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the CNI specification, and any required CNI plugins referenced by the configuration must be present in --cni-bin-dir
(default /opt/cni/bin
).
If there are multiple CNI configuration files in the directory, the kubelet uses the configuration file that comes first by name in lexicographic order.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI lo
plugin, at minimum version 0.2.0
Support hostPort
The CNI networking plugin supports hostPort
. You can use the official portmap
plugin offered by the CNI plugin team or use your own plugin with portMapping functionality.
If you want to enable hostPort
support, you must specify portMappings capability
in your cni-conf-dir
.
For example:
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
Support traffic shaping
Experimental Feature
The CNI networking plugin also supports pod ingress and egress traffic shaping. You can use the official bandwidth plugin offered by the CNI plugin team or use your own plugin with bandwidth control functionality.
If you want to enable traffic shaping support, you must add the bandwidth
plugin to your CNI configuration file
(default /etc/cni/net.d
) and ensure that the binary is included in your CNI bin dir (default /opt/cni/bin
).
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
Now you can add the kubernetes.io/ingress-bandwidth
and kubernetes.io/egress-bandwidth
annotations to your pod.
For example:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
kubenet
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
Kubenet creates a Linux bridge named cbr0
and creates a veth pair for each pod with the host end of each pair connected to cbr0
. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. cbr0
is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
The plugin requires a few things:
- The standard CNI
bridge
,lo
andhost-local
plugins are required, at minimum version 0.2.0. Kubenet will first search for them in/opt/cni/bin
. Specifycni-bin-dir
to supply additional search path. The first found match will take effect. - Kubelet must be run with the
--network-plugin=kubenet
argument to enable the plugin - Kubelet should also be run with the
--non-masquerade-cidr=<clusterCidr>
argument to ensure traffic to IPs outside this range will use IP masquerade. - The node must be assigned an IP subnet through either the
--pod-cidr
kubelet command-line option or the--allocate-node-cidrs=true --cluster-cidr=<cidr>
controller-manager command-line options.
Customizing the MTU (with kubenet)
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for most network plugins.
Where needed, you can specify the MTU explicitly with the network-plugin-mtu
kubelet option. For example,
on AWS the eth0
MTU is typically 9001, so you might specify --network-plugin-mtu=9001
. If you're using IPSEC you
might reduce it to allow for encapsulation overhead; for example: --network-plugin-mtu=8873
.
This option is provided to the network-plugin; currently only kubenet supports network-plugin-mtu
.
Usage Summary
--network-plugin=cni
specifies that we use thecni
network plugin with actual CNI plugin binaries located in--cni-bin-dir
(default/opt/cni/bin
) and CNI plugin configuration located in--cni-conf-dir
(default/etc/cni/net.d
).--network-plugin=kubenet
specifies that we use thekubenet
network plugin with CNIbridge
,lo
andhost-local
plugins placed in/opt/cni/bin
orcni-bin-dir
.--network-plugin-mtu=9001
specifies the MTU to use, currently only used by thekubenet
network plugin.
What's next
2 - Device Plugins
Kubernetes v1.10 [beta]
Kubernetes provides a device plugin framework that you can use to advertise system hardware resources to the Kubelet.
Instead of customizing the code for Kubernetes itself, vendors can implement a device plugin that you deploy either manually or as a DaemonSet. The targeted devices include GPUs, high-performance NICs, FPGAs, InfiniBand adapters, and other similar computing resources that may require vendor specific initialization and setup.
Device plugin registration
The kubelet exports a Registration
gRPC service:
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}
A device plugin can register itself with the kubelet through this gRPC service. During the registration, the device plugin needs to send:
- The name of its Unix socket.
- The Device Plugin API version against which it was built.
- The
ResourceName
it wants to advertise. HereResourceName
needs to follow the extended resource naming scheme asvendor-domain/resourcetype
. (For example, an NVIDIA GPU is advertised asnvidia.com/gpu
.)
Following a successful registration, the device plugin sends the kubelet the
list of devices it manages, and the kubelet is then in charge of advertising those
resources to the API server as part of the kubelet node status update.
For example, after a device plugin registers hardware-vendor.example/foo
with the kubelet
and reports two healthy devices on a node, the node status is updated
to advertise that the node has 2 "Foo" devices installed and available.
Then, users can request devices in a Container specification as they request other types of resources, with the following limitations:
- Extended resources are only supported as integer resources and cannot be overcommitted.
- Devices cannot be shared among Containers.
Suppose a Kubernetes cluster is running a device plugin that advertises resource hardware-vendor.example/foo
on certain nodes. Here is an example of a pod requesting this resource to run a demo workload:
---
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo-container-1
image: k8s.gcr.io/pause:2.0
resources:
limits:
hardware-vendor.example/foo: 2
#
# This Pod needs 2 of the hardware-vendor.example/foo devices
# and can only schedule onto a Node that's able to satisfy
# that need.
#
# If the Node has more than 2 of those devices available, the
# remainder would be available for other Pods to use.
Device plugin implementation
The general workflow of a device plugin includes the following steps:
Initialization. During this phase, the device plugin performs vendor specific initialization and setup to make sure the devices are in a ready state.
The plugin starts a gRPC service, with a Unix socket under host path
/var/lib/kubelet/device-plugins/
, that implements the following interfaces:service DevicePlugin { // GetDevicePluginOptions returns options to be communicated with Device Manager. rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {} // ListAndWatch returns a stream of List of Devices // Whenever a Device state change or a Device disappears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} // Allocate is called during container creation so that the Device // Plugin can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) {} // GetPreferredAllocation returns a preferred set of devices to allocate // from a list of available ones. The resulting preferred allocation is not // guaranteed to be the allocation ultimately performed by the // devicemanager. It is only designed to help the devicemanager make a more // informed allocation decision when possible. rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {} // PreStartContainer is called, if indicated by Device Plugin during registeration phase, // before each container start. Device plugin can run device specific operations // such as resetting the device before making devices available to the container. rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {} }
Note: Plugins are not required to provide useful implementations forGetPreferredAllocation()
orPreStartContainer()
. Flags indicating which (if any) of these calls are available should be set in theDevicePluginOptions
message sent back by a call toGetDevicePluginOptions()
. Thekubelet
will always callGetDevicePluginOptions()
to see which optional functions are available, before calling any of them directly.The plugin registers itself with the kubelet through the Unix socket at host path
/var/lib/kubelet/device-plugins/kubelet.sock
.After successfully registering itself, the device plugin runs in serving mode, during which it keeps monitoring device health and reports back to the kubelet upon any device state changes. It is also responsible for serving
Allocate
gRPC requests. DuringAllocate
, the device plugin may do device-specific preparation; for example, GPU cleanup or QRNG initialization. If the operations succeed, the device plugin returns anAllocateResponse
that contains container runtime configurations for accessing the allocated devices. The kubelet passes this information to the container runtime.
Handling kubelet restarts
A device plugin is expected to detect kubelet restarts and re-register itself with the new
kubelet instance. In the current implementation, a new kubelet instance deletes all the existing Unix sockets
under /var/lib/kubelet/device-plugins
when it starts. A device plugin can monitor the deletion
of its Unix socket and re-register itself upon such an event.
Device plugin deployment
You can deploy a device plugin as a DaemonSet, as a package for your node's operating system, or manually.
The canonical directory /var/lib/kubelet/device-plugins
requires privileged access,
so a device plugin must run in a privileged security context.
If you're deploying a device plugin as a DaemonSet, /var/lib/kubelet/device-plugins
must be mounted as a Volume
in the plugin's
PodSpec.
If you choose the DaemonSet approach you can rely on Kubernetes to: place the device plugin's Pod onto Nodes, to restart the daemon Pod after failure, and to help automate upgrades.
API compatibility
Kubernetes device plugin support is in beta. The API may change before stabilization, in incompatible ways. As a project, Kubernetes recommends that device plugin developers:
- Watch for changes in future releases.
- Support multiple versions of the device plugin API for backward/forward compatibility.
If you enable the DevicePlugins feature and run device plugins on nodes that need to be upgraded to a Kubernetes release with a newer device plugin API version, upgrade your device plugins to support both versions before upgrading these nodes. Taking that approach will ensure the continuous functioning of the device allocations during the upgrade.
Monitoring Device Plugin Resources
Kubernetes v1.15 [beta]
In order to monitor resources provided by device plugins, monitoring agents need to be able to
discover the set of devices that are in-use on the node and obtain metadata to describe which
container the metric should be associated with. Prometheus metrics
exposed by device monitoring agents should follow the
Kubernetes Instrumentation Guidelines,
identifying containers using pod
, namespace
, and container
prometheus labels.
The kubelet provides a gRPC service to enable discovery of in-use devices, and to provide metadata for these devices:
// PodResourcesLister is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResourcesLister {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
}
The gRPC service is served over a unix socket at /var/lib/kubelet/pod-resources/kubelet.sock
.
Monitoring agents for device plugin resources can be deployed as a daemon, or as a DaemonSet.
The canonical directory /var/lib/kubelet/pod-resources
requires privileged access, so monitoring
agents must run in a privileged security context. If a device monitoring agent is running as a
DaemonSet, /var/lib/kubelet/pod-resources
must be mounted as a
Volume in the device monitoring agent's
PodSpec.
Support for the "PodResources service" requires KubeletPodResources
feature gate to be enabled.
It is enabled by default starting with Kubernetes 1.15 and is v1 since Kubernetes 1.20.
Device Plugin integration with the Topology Manager
Kubernetes v1.18 [beta]
The Topology Manager is a Kubelet component that allows resources to be co-ordinated in a Topology aligned manner. In order to do this, the Device Plugin API was extended to include a TopologyInfo
struct.
message TopologyInfo {
repeated NUMANode nodes = 1;
}
message NUMANode {
int64 ID = 1;
}
Device Plugins that wish to leverage the Topology Manager can send back a populated TopologyInfo struct as part of the device registration, along with the device IDs and the health of the device. The device manager will then use this information to consult with the Topology Manager and make resource assignment decisions.
TopologyInfo
supports a nodes
field that is either nil
(the default) or a list of NUMA nodes. This lets the Device Plugin publish that can span NUMA nodes.
An example TopologyInfo
struct populated for a device by a Device Plugin:
pluginapi.Device{ID: "25102017", Health: pluginapi.Healthy, Topology:&pluginapi.TopologyInfo{Nodes: []*pluginapi.NUMANode{&pluginapi.NUMANode{ID: 0,},}}}
Device plugin examples
Here are some examples of device plugin implementations:
- The AMD GPU device plugin
- The Intel device plugins for Intel GPU, FPGA and QuickAssist devices
- The KubeVirt device plugins for hardware-assisted virtualization
- The NVIDIA GPU device plugin
- Requires nvidia-docker 2.0, which allows you to run GPU-enabled Docker containers.
- The NVIDIA GPU device plugin for Container-Optimized OS
- The RDMA device plugin
- The Solarflare device plugin
- The SR-IOV Network device plugin
- The Xilinx FPGA device plugins for Xilinx FPGA devices
What's next
- Learn about scheduling GPU resources using device plugins
- Learn about advertising extended resources on a node
- Read about using hardware acceleration for TLS ingress with Kubernetes
- Learn about the Topology Manager