The Amazon i3 Family

Amazon has recently released to general availability the i3.metal instance, which allows us to do some things which we could not do before in the Amazon cloud, such as running an unmodified hypervisor. We were able to run more than six thousand KVM virtual machines on one of these instances, far beyond our pessimistic guess of around two thousand. In the remainder of this post we will discuss what makes these platforms important and unique, how we ran KVM virtual machines on the platform using Amazon’s own Linux distribution, and how we measured its performance and capacity using kprobes and the extended Berkeley Packetcpu Filter eBPF .

Read on for details!

i3.metal and the Nitro System

The i3 family platforms include two improvements from what Amazon has historically offered to AWS customers. The first is the combination of the Annapurna ASIC and the Nitro PCI cardwhich together integrate security, storage, and network I/O within custom silicon. The second improvement is the Nitrohypervisor, which replaces Xen for all new EC2 instance types. Together, we refer to the Nitro card, Annapurna ASIC, and Nitro hypervisor as the Nitro System. (See the EC2 FAQs entry for the Nitro Hypervisor for some additional details.)

Although Amazon has not released much information about the Nitro system there are important technical insights in Brendan Gregg’s blog and in two videos ( here and here ) from the November 2017 AWS re:Invent conference. From these presentations, it is clear that the Nitro firmware includes a stripped-down version of the KVM hypervisor that forgoes the QEMU emulator and passes hardware directly to the running instance. In this sense, Nitro is more properly viewed as partitioning firmware that uses hardware self-virtualization features, including support for nested virtualization on the i3.metal instances.

Nitro protects the Annapurna ASIC and the multi-root PCI hardware from being reprogrammed for the i3.metal systems, but nothing else (this invisible presence is to protect against the use of unauthorized elastic block stores or network access.) For example, while Nitro has no hardware emulation (which is the role of QEMU in a conventional KVM hypervisor), Nitro does enable self-virtualizing hardware (pdf). Importantly, Nitro on the i3.metal system exposes hardware virtualization features to the running kernel, which can be a hypervisor. Thus, a hypervisor such as KVM, Xen, or VMWare can be run directly in an i3.metal instance partitioned by the Nitro firmware.

Image above: Amazon’s i3 platform includes the Annapurna ASIC, the Nitro PCI Card, and the Nitro Firmware. See https://youtu.be/LabltEXk0VQ

Key Virtualization Features Exploited by the Nitro Firmware

Below is a brief, incomplete summary of virtualization features exploited by the Nitro system—particularly in the bare metal instances.

VMCS Shadowing

Virtual Machine Control Structure (VMCS) Shadowing provides hardware-nested virtualization on Intel Processors. The VMCS is a set of registers that controls access to hardware features by a virtual machine (pdf). The first-level hypervisor—in this case the Nitro system—keeps a copy of the second to nth level VMCS and only investigates registers that are different from the cached version. Not every register in the VMCS requires the first level hypervisor to monitor. The Nitro firmware thus provides nested virtualization with no material effect on performance (consuming only a small amount of additional processor resources). If the instance hypervisor does not violate the boundaries established by Nitro, there is no intervention and no effect upon performance.

Most significantly, VMCS shadowing registers are freely available to the kernel running on the bare-metal instance, which is unique for EC2  instances.

Extended Page Tables

Once the hypervisor has established memory boundaries for the virtual machine, Extended Page Tables (EPT) are a hardware feature that allows a virtual machine to manage its own page tables. Enabling this hardware feature produced a two order magnitude of improvement in virtual machine performance on x86 hardware.

Like VMCS shadowing, EPT works especially well with nested hypervisors. The Nitro firmware establishes a page table for the bare-metal workload (Linux, KVM, or another hypervisor.) The bare-metal workload manages its own page tables.

As long as it does not violate the boundaries established by the Nitro firmware, Nitro does not effect the performance or functionality of the bare-metal workload. Nitro’s role on i3.metal workloads prevents the workload from gaining the ability to re-configure the Annapurna ASIC or the Nitro card and violating the limits set for the instance.

Posted Interrupts

The multi-root virtualization capability (pptx) in the i3 instances virtualizes the Amazon Enhanced Networking and Elastic Block Storage (EBS) using PCI hardware devices (Annapurna ASIC and the Nitro card) assigned by the Nitro firmware to specific bare-metal workloads.

Posted interrupts (pdf) allow system firmware to deliver hardware interrupts directly to a virtual machine, when that virtual machine is assigned a PCI function. The Nitro system uses posted interrupts to allow the bare-metal workload to process hardware interrupts generated by the Nitro hardware without any intervention from the Nitro System.

That is, the Annapurna ASIC and Nitro PCI card can interrupt the bare-metal workload directly, while remaining protected from re-configuration by the bare metal workload. There are no detrimental effects on performance as long as the Nitro System does not over-provision CPUs, which it does not do. (The bare-metal workload may, even if it is a hypervisor, as we will see below in the limited testing we did)

Loading KVM on a Bare Metal Instance

On an EC2 Bare Metal system (i3.metal in the screen grab above), Nitro is hardware partitioning firmware. The Nitro firmware is based on KVM and does not use hardware emulation software (such as QEMU). It does initialize the custom Amazon hardware and pass-through hardware to the running instance: networking, storage, processors, PCI trees, and memory. It then jumps into the bare-metal instance kernel, which in our testing was Amazon Linux. (Amazon also supports the VMware Hypervisor as a bare-metal instance)

The Nitro firmware only activates if the bare-metal kernel violates established partitioning. The fact that the Nitro firmware is actually Linux and KVM is not new: Linux has been used as BIOS for many years for complex systems that consolidate networked or shared resources for hardware platforms.

Passing-through the VMX flag and Running Nested Virtualization

The Bare Metal kernel sees the vmx flag when it inspects /proc/cpuinfo:

This flag is necessary in order to load KVM. It indicates that the Virtual Machine Control Structure (VMCS) is programmable by the Linux-KVM kernel. VMCS Shadowing makes this possible; it uses copy-on-write methods and register caching in the processor itself to run each layer in the stack (Nitro, KVM, and the Virtual Machine) directly on the processor hardware. Each layer is controlled by the layer beneath it.

The i3.metal systems use register caching and snooping to provide hardware-virtualized processors to each layer in the system, beginning with the Nitro System, up to virtual machines being run by the bare-metal instance (KVM in this case).

The Nitro firmware does not use QEMU because it does not emulate any hardware. In our testing, we did use QEMU hardware emulation in the upper layer virtual machines. This resulted in the picture below, where the Nitro firmware is running beneath the i3 instance kernel. We then loaded KVM, and used QEMU to provide hardware emulation to the virtual machines:

When running a hypervisor such as KVM on the i3.metal systems, each layer has direct access to the processor through VMCS Shadowing, which provides each layer with the Virtual Machine Control registers.

Installing KVM on an Amazon Linux Image

The Amazon Linux distribution is derived from Fedora Linux with KVM available as two loadable modules. (KVM is maintained and supported by Amazon as a standard feature of the bare metal instance.)

Some components need to be installed, for example QEMU:

Libvirt is not part of the Amazon Linux distribution, which saves cost . We do not need Libvirt, and it would get in the way of later testing.

Libvirt is an adequate collection of software, but qemu-kvm is not aware of it, meaning the virtual machine state information stored by Libvirt may be out of sync with qemu-kvm . Libvirt also provides an additional attack vector to KVM while providing little additional functionality over what is provided by standard Linux utilities and kernel features, with  qemu-kvm.

Built-in Processor Support for KVM

The i3.metal instance has 72 threads running on 36 physical cores that support KVM and posted interrupts. This information may be read in /proc/cpuinfo: 

Loading KVM on the Nitro system is most easily done by   modprobe’ing the KVM modules:

The irqbypass  module provides posted interrupts to KVM virtual machines, reminding us again that we may pass PCI devices present on the bare-metal host through to KVM virtual machines.

Built-in virtio virtual I/O at the Linux Kernel Level

virtio  is a Linux kernel i/o virtualization feature: it is maintained and supported by Amazon and that it works with qemu-kvm  to provide isolated (not shared as in Xen’s dom0  netback and blockback) virtual i/o devices for virtual machines that do not need direct access to a hardware PCI device. Each virtio  device is a unique and private virtual PCI device with separation provided by the Linux kernel.

The Amazon Linux kernel supports virtio devices, as shown by this excerpt of the Amazon Linux configuration file:

Kernel Shared Memory (KSM)

KSM is a Linux kernel feature that scans memory pages, merges duplicates, marks those pages as read-only, and copies the pages when they are written (COW).  KSM provides a kernel-level mechanism for over-provisioning memory. KSM is automatic, built in, and does not require an external module as Xen does, for example, with its Dom0 balloon driver.

KSM is documented in the Linux kernel documentation directory.

The Amazon Linux kernel is configured with KSM:

Running a KVM virtual machine with copy-on-write memory is straightforward, by starting the virtual machine with the mem-merge feature turned on: 

Using the -machine mem-merge=on  command upon virtual machine startup causes QEMU to execute anmadvise system call with the MADV_MERGEABLE parameter for the virtual machine memory, marking the VM memory as merge-able.

To disable merging for a virtual machine upon startup, use the same command but substitute  mem-merge=off . 

Running the KVM Virtual Machine

We created a virtual machine using a minimal Linux distribution: TTY Linux. It has an image built specifically to run with KVM using virtio  network and block devices.

We ran KVM Linux virtual machines using this command line:

Only three steps are required to create the virtual machine:

  1. Download the TTY Linux distribution and unzip to an iso image:
  1. Create the qcow disk image for the virtual machine:
  1. Run the virtual machine:

We were struck by how easy it was to run KVM virtual machines on these Nitro systems, configured as they are with Amazon Linux. Each virtual machine in our testing had 1G of memory and 1G of writeable storage.

numactl and other Linux Process Control

A benefit of  KVM on i3.metal is the ability to use standard Linux system calls to control virtual machine resources. A good example is using the Linux numactl  command to allocate CPU cores for a kvm virtual machine: 

The above command uses numactl utility to bind the KVM virtual machine to Core #1.  It demonstrates how integrated KVM is with the Linux kernel and how simple it is to allocate memory and cores to specific virtual machines.

Integration with the Linux Kernel: cgroups, nice, numactl, taskset

We can turn the Linux kernel into a hypervisor by loading the KVM modules and starting a virtual machine, but the Linux personality is still there. We can control the virtual machine using standard Linux resource and process control tools such as cgroups, nice, numactl, and taskset :

All cgroup  commands work naturally with KVM virtual machines. As far as cgroups is concerned, each KVM virtual machine is a normal Linux process (although KVM runs that process at the highest privilege level in VMX guest mode (pptx), which provides hardware virtualization support directly to the virtual machine). There are two utilities to bind a KVM virtual machine to a specific processor, NUMA node, or memory zone:taskset  and numactl .

In summary, the Linux command set along with qemu-kvm  allows us native control over processors, memory zones, and other platform properties for to running KVM virtual machines. Libvirt, on the other hand, is a layer over these native control interfaces that tends to obscure what is really going on at the hardware level.

Testing the Limits of  Bare-Metal AWS Hypervisor Performance

To more securely run virtual-machine workloads on cloud services, we accessed a bare-metal instance for project research during the preview period. We wanted to first verify that KVM can be used as a hypervisor on EC2 bare-metal instances, and second, get a read on stability and performance. We had limited time for this portion of the research.

To measure system response, we decided to use the BPF Compiler Collection (BCC) (building and using this toolset may be the subject of another blog post).

BCC uses the extended Berkeley Packet Filter, an amazing piece of technology in recent Linux Kernels that runs user-space byte code within kernel space. BCC compiles byte code that uses dynamic kernel probes to instrument kernel behavior.

To test CPU load, we added a simple shell script to each VM’s init process:

This ensured that each virtual machine would be consuming all the CPU cycles allowed to it by KVM.

Next, we used a simple shell script to start KVM virtual machines into oblivion:

Then we ran the BCC program runqlat.py, which measures how much time processes are spending on the scheduler’s run queue – a measure of system load and stability. The histogram below shows the system when running 6417 virtual machines.

The histogram above demonstrates how long, within a range, each sampled process waited on the KVM scheduler’s run queue before it was actually placed on the processor and run. The wait time in usecs shows how long a process that is runnable (not sleeping or waiting for any resources or events to occur) waited in order to run. There are three things to look for in this histogram:

  1. How closely grouped are the sampled wait times? Most processes should be waiting approximately the same time. This histogram shows this is the case, with close to half samples waiting between 4 and 15 microseconds.
  2. How low are the wait times? On a system that is under-utilized, the wait times should be mostly immaterial (just a few microseconds or less on this hardware). This system is over-utilized, and yet the wait times for most of the samples are fewer than 15 microseconds.
  3. How scattered are the samples in terms of wait times? In this histogram there are two groups: the larger group with wait times less than 511 microseconds, and the smaller group with wait times between 1024 and 32767 microseconds. The second group consists of only roughly 7% of samples. We would expect a distressed system to show several different groups clustered around longer wait times, with outliers comprising more than 7% of all samples.

Upon reaching 6417 virtual machines, the system was unable to start any new VMs, due to memory exhaustion. However we were able to  ssh to running VMs; when we stopped a VM, KVM started a new one. This system appeared to be capable of running indefinitely with this extreme load placed upon the CPU resources.

CPU and Memory Over-provisioning

When fully loaded with virtual machines, CPUs were overloaded 10:1 virtual cycles to physical cycles. There were more than thirty thousand processes running on the system, and it was actively reclaiming memory using KSM (discussed above). Before running the tests, the consensus among our team was that perhaps we could run 2K virtual machines before the system fell apart. This guess (that’s all it was) proved to be overly pessimistic. (However, we did not test I/O capacity in any significant way.)

Beyond proving that we could run a hell of a lot of virtual machines on the i3.metal platform, and that CPU over provisioning was wickedly efficient, we didn’t accomplish much else; for example, we can conclude nothing about the I/O performance of the system. But these are rich grounds for further performance and limit testing using the BCC toolkit, which we hope to discuss in a later blog post.


Firecracker는 Serverless 컴퓨팅을 위한 안전하고 빠른 microVM을 위한 AWS에서 만든 KVM기반의 새로운 hypervisor 이다.

Lambda와 Fargate등 컨테이너를 써야 하지만 공유 보안이 문제가 되는 곳에  컨테이너 수준의 빠른 실행을 보장하는 hypervisor이다.

이는 AWS 뿐만 아니라 OnPremise이든 IBM의 Baremetal이든 다 올릴 수 있다. 물론 PC에서도 가능하다.  (vagrant에 쓰면 좋겠는데?)


원문 : https://aws.amazon.com/ko/blogs/opensource/firecracker-open-source-secure-fast-microvm-serverless/


가상화 기술의 새로운 과제

오늘날 고객은 서버리스 컴퓨팅을 사용하여 인프라의 구축 또는 관리에 대한 걱정없이 애플리케이션을 구축 할 수 있습니다. 개발자는 코드를 AWS Fargate를 사용하여 서버리스 컨테이너를 사용하거나 AWS Lambda를 사용하여 서버리스 함수들을 사용 할 수 있습니다. 고객들은 서버리스의 낮은 운영 오버 헤드를 너무 좋아합니다. 우리 또한 서버리스가 향후 컴퓨팅에서 중추적 인 역할을 계속할 것으로 믿습니다.

고객이 서버리스를 점점 더 많이 채택함에 따라 기존의 가상화 기술은 이벤트 중심 또는 짧은 수명의 특성이 있는 이러한 유형의 워크로드의 특성에 최적화되지 않았음을 알게 되었습니다. 우리는 서버리스 컴퓨팅을 위해 특별히 설계된 가상화 기술을 구축 할 필요성을 확인했습니다. 가상 시스템의 하드웨어 가상화 기반 보안 경계를 제공하면서도 컨테이너 크기와 기능의 민첩성을 유지하면서 크기를 유지할 수있는 방법이 필요했습니다.


Firecracker Technology

Meet Firecracker, an open source virtual machine monitor (VMM) that uses the Linux Kernel-based Virtual Machine (KVM). Firecracker allows you to create micro Virtual Machines or microVMs. Firecracker is minimalist by design – it includes only what you need to run secure and lightweight VMs. At every step of the design process, we optimized Firecracker for security, speed, and efficiency. For example, we can only boot relatively recent Linux kernels, and only when they are compiled with a specific set of configuration options (there are 1000+ kernel compile config options). Also, there is no support for graphics or accelerators of any kind, no support for hardware passthrough, and no support for (most) legacy devices.

Firecracker boots a minimal kernel config without relying on an emulated bios and without a complete device model. The only devices are virtio net and virtio block, as well as a one-button keyboard (the reset pin helps when there’s no power management device). This minimal device model not only enables faster startup times (< 125 ms on an i3.metal with the default microVM size), but also reduces the attack surface, for increased security. Read more details about Firecracker’s promise to enable minimal-overhead execution of container and serverless workloads.

In the fall of 2017, we decided to write Firecracker in Rust, a modern programming language that guarantees thread and memory safety and prevents buffer overflows and many other types of memory safety errors that can lead to security vulnerabilities. Read more details about the features and architecture of the Firecracker VMM at Firecracker Design.

Firecracker microVMs improve efficiency and utilization with a low memory overhead of < 5 MiB per microVMs. This means that you can pack thousands of microVMs onto a single machine. You can use an in-process rate limiter to control, with fine granularity, how network and storage resources are shared, even across thousands of microVMs. All hardware compute resources can be safely oversubscribed, to maximize the number of workloads that can run on a host.

We developed Firecracker with the following guiding tenets (unless you know better ones) for the open source project:

  • Built-In Security: We provide compute security barriers that enable multitenant workloads, and cannot be mistakenly disabled by customers. Customer workloads are simultaneously considered sacred (shall not be touched) and malicious (shall be defended against).
  • Light-Weight Virtualization: We focus on transient or stateless workloads over long-running or persistent workloads. Firecracker’s hardware resources overhead is known and guaranteed.
  • Minimalist in Features: If it’s not clearly required for our mission, we won’t build it. We maintain a single implementation per capability.
  • Compute Oversubscription: All of the hardware compute resources exposed by Firecracker to guests can be securely oversubscribed.

We open sourced this foundational technology because we believe that our mission to build the next generation of virtualization for serverless computing has just begun.

Firecracker Usage

AWS Lambda uses Firecracker as the foundation for provisioning and running sandboxes upon which we execute customer code. Because Firecracker provides a secure microVM which can be rapidly provisioned with a minimal footprint, it enables performance without sacrificing security. This lets us drive high utilization on physical hardware, as we can now optimize how we distribute and run workloads for Lambda, mixing workloads based on factors like active/idle periods, and memory utilization.

Previously, Fargate Tasks consisted of one or more Docker containers running inside a dedicated EC2 VM to ensure isolation across Tasks. These Tasks now execute on Firecracker microVMs, which allows us to provision the Fargate runtime layer faster and more efficiently on EC2 bare metal instances, and improve density without compromising kernel-level isolation of Tasks. Over time, this will allow us to continue to innovate at the runtime layer, giving our customers even better performance while maintaining our high security bar, and lowering the overall cost of running serverless container architectures.

Firecracker runs on Intel processors today, with support for AMD and ARM coming in 2019.

You can run Firecracker on AWS .metal instances, as well as on any other bare-metal server, including on-premises environments and developer laptops.

Firecracker will also enable popular container runtimes such as containerd to manage containers as microVMs. This allows Docker and container orchestration frameworks such as Kubernetes to use Firecracker. We have built a prototype that enables containerd to manage containers as Firecracker microVMs and would like to with with community to take it further.

Getting Started with Firecracker

Getting Started with Firecracker provides detailed instructions on how to download the Firecracker binary, start Firecracker with different options, build from the source, and run integration tests. You can run Firecracker in production using the Firecracker Jailer.

Let’s take a look at how to get started with using Firecracker on AWS Cloud (these steps can be used on any bare metal machine):

Create an i3.metal instance using Ubuntu 18.04.1.

Firecracker is built on top of KVM and needs read/write access to /dev/kvm. Log in to the host in one terminal and set up that access:

sudo setfacl -m u:${USER}:rw /dev/kvm

Download and start the Firecracker binary:

curl -L https://github.com/firecracker-microvm/firecracker/releases/download/v0.11.0/firecracker-v0.11.0
./firecracker-v0.11.0 --api-sock /tmp/firecracker.sock

Each microVM can be accessed using a REST API. In another terminal, query the microVM:

curl --unix-socket /tmp/firecracker.sock "http://localhost/machine-config"

This returns a response:

{ "vcpu_count": 1, "mem_size_mib": 128,  "ht_enabled": false,  "cpu_template": "Uninitialized" }

This starts a VMM process and waits for the microVM configuration. By default, one vCPU and 128 MiB memory are assigned to each microVM. Now this microVM needs to be configured with an uncompressed Linux kernel binary and an ext4 file system image to be used as root filesystem.

Download a sample kernel and rootfs:

curl -fsSL -o hello-vmlinux.bin https://s3.amazonaws.com/spec.ccfc.min/img/hello/kernel/hello-vmlinux.bin
curl -fsSL -o hello-rootfs.ext4 https://s3.amazonaws.com/spec.ccfc.min/img/hello/fsfiles/hello-rootfs.ext4

Set up the guest kernel:

curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/boot-source'   \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{        "kernel_image_path": "./hello-vmlinux.bin", "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"    }'

Set up the root filesystem:

curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/drives/rootfs' \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{        "drive_id": "rootfs",        "path_on_host": "./hello-rootfs.ext4",        "is_root_device": true,        "is_read_only": false    }'

Once the kernel and root filesystem are configured, the guest machine can be started:

curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/actions'       \
    -H  'Accept: application/json'          \
    -H  'Content-Type: application/json'    \
    -d '{        "action_type": "InstanceStart"     }'

The first terminal now shows a serial TTY prompting you to log in to the guest machine:

Welcome to Alpine Linux 3.8
Kernel 4.14.55-84.37.amzn2.x86_64 on an x86_64 (ttyS0)
localhost login:

Log in as root with password root to see the terminal of the guest machine:

localhost login: root
Password:
Welcome to Alpine! 

The Alpine Wiki contains a large amount of how-to guides and general information about administrating Alpine systems. 

See <http://wiki.alpinelinux.org>. 

You can setup the system with the command: setup-alpine 

You may change this message by editing /etc/motd.

login[979]: root login on 'ttyS0' 
localhost:~#

You can see the filesystem usingls /

localhost:~# ls /
bin         home        media       root        srv         usr
dev         lib         mnt         run         sys         var
etc         lost+found  proc        sbin        tmp

Terminate the microVM using the reboot command. Firecracker currently does not implement guest power management, as a tradeoff for efficiency. Instead, the reboot command issues a keyboard reset action which is then used as a shutdown switch.

Once the basic microVM is created, you can add network interfaces, add more drives, and continue to configure the microVM.

Want to create thousands of microVMs on your bare metal instance?

for ((i=0; i<1000; i++)); do
    ./firecracker-v0.10.1 --api-sock /tmp/firecracker-$i.sock &
done

Multiple microVMs may be configured with a single shared root file system, and each microVM can then be assigned its own read/write share.

Firecracker and Open Source

It is our mission to innovate on behalf of and for our customers, and we will continue to invest deeply in serverless computing at all three critical layers of the stack: the application, virtualization, and hardware layers. We want to offer our customers their choice of compute, whether instances or serverless, with no compromises on security, scalability, or performance. Firecracker is a fundamental building block for providing that experience.

Investing deeply in foundational technologies is one of the key ways that we at AWS approach innovation – not for tomorrow, but for the next decade and beyond. Sharing this technology with the community goes hand-in-hand with this innovation. Firecracker is licensed under Apache 2.0. Please visit the Firecracker GitHub repo to learn more and contribute to Firecracker.

By open sourcing Firecracker, we not only invite you to a deeper examination of the foundational technologies that we are building to underpin the future of serverless computing, but we also hope that you will join us in strengthening and improving Firecracker. See the Firecracker issues list and the Firecracker roadmap for more information.



AWS 비용 계산기가 새로 나왔습니다.

UI도 깔끔해졌고, 여러 리전을 그룹으로 지정해서 같이 계산도 가능합니다.

Edit Group으로 원하는 리전을 먼저 그룹으로 만들고 시작하세요.

https://calculator.aws/