AWS开源了一个新的虚拟化技术,叫做 firecracker.

Firecracker is an open source virtualization technology that is purpose-built for creating and managing secure, multi-tenant container and function-based services that provide serverless operational models. 

确实是解决了docker的痛点。不知道你有没有注意到,在docker里面查看 /proc/cpuinfo/proc/meminfo,看到的都是宿主机的信息,一方面暴漏了主机的信息,另一方面对一些Java应用会造成误导,因为JVM默认堆内存是/proc/meminfo的1/4(Ref),如果按宿主机的来设置,可能会很容易内存出错。

通常需要使用lxcfs来解决这个问题。

Firecracker处于VM和docker之间,称之为 microVM 。与docker不同,fc使用的是kvm,可以带来更好的安全、隔离,适应于多租户环境。AWS在lambda中使用的比较多。

arch

来用一下。

Firecracker需要操作kvm,因此宿主机上必须安装kvm相关。

sudo apt-get install qemu-kvm

安装后会创建 /dev/kvm 文件。如果有报 Not Found,可以先安装下上面的包。

Firecracker还需要内核版本大于等于 4.14,如果内核版本低于4.14,则需要升级内核。ubuntu/deepin版本更新及时一般问题不大,如果是centos,则可以考虑使用社区版本的内核,目前centos7已经升级到4.18了。

设置 /dev/kvm 权限,允许当前用户读写。

sudo setfacl -m u:${USER}:rw /dev/kvm

启动 Firecracker服务器。它会启动一个http server,该socket通过/tmp/firecracker.sock这个文件来访问。Firecracker的bin文件可以从release下载,当前最新是v0.11.0。

sudo rm -f /tmp/firecracker.sock
sudo ./firecracker-v0.11.0 --api-sock /tmp/firecracker.sock

Firecracker是microVM,参考kvm,需要kernel和rootfs。官方的指导文档是从s3上下载,不过s3貌似被墙了,我取了一份放到了腾讯云的对象存储里,你可以从腾讯云下载。打个广告,腾讯云免费提供50GB的存储,可惜只有10GB每月的流量。

下到同一个目录下。

wget https://silenceshell-1255345740.cos.ap-shanghai.myqcloud.com/hello-vmlinux.bin
wget https://silenceshell-1255345740.cos.ap-shanghai.myqcloud.com/hello-rootfs.ext4

再开一个terminal,使用curl命令来操作fc。

#!/bin/bash

sudo curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/boot-source'   \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{
        "kernel_image_path": "./hello-vmlinux.bin",
        "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
    }'
sudo curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/drives/rootfs' \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{
        "drive_id": "rootfs",
        "path_on_host": "./hello-rootfs.ext4",
        "is_root_device": true,
        "is_read_only": false
    }'
sudo curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT 'http://localhost/actions'       \
    -H  'Accept: application/json'          \
    -H  'Content-Type: application/json'    \
    -d '{
        "action_type": "InstanceStart"
     }'

回到之前的firecracker监听的terminal,可以看到已经进了 microVM ,用户名密码为 root/root 。

sudo ./firecracker-v0.11.0 --api-sock /tmp/firecracker.sock
[    0.000000] Linux version 4.14.55-84.37.amzn2.x86_64 (mockbuild@ip-10-0-1-79) (gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)) #1 SMP Wed Jul 25 18:47:15 UTC 2018
[    0.000000] Command line: console=ttyS0 reboot=k panic=1 pci=off  root=/dev/vda 
virtio_mmio.device=4K@0xd0000000:5
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] e820: BIOS-provided physical RAM map:
...

Welcome to Alpine Linux 3.8
Kernel 4.14.55-84.37.amzn2.x86_64 on an x86_64 (ttyS0)

localhost login: root
Password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org>.

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

login[980]: root login on 'ttyS0'
localhost:~# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/root                28.0M     21.1M      4.9M  81% /
devtmpfs                 10.0M         0     10.0M   0% /dev
tmpfs                    11.2M     96.0K     11.1M   1% /run
shm                      56.1M         0     56.1M   0% /dev/shm
localhost:~# ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 {openrc-init} /sbin/init
    2 root      0:00 [kthreadd]
    3 root      0:00 [kworker/0:0]

默认microVM的资源是1核128MB内存,查看 /proc/cpuinfo/proc/meminfo可以看到已经是正确的值了。

根文件系统的大小为28M,这个大小也就是 rootfs 的大小,firecracker是将 hello-rootfs.ext4 文件当做块设备来用的。可以在根分区下写点东西,关闭 microVM (通过reboot命令)后重新再启动 microVM ,可以发现之前写的文件还在。

Firecracker启动速度很快,很轻量,黑科技,有前景,非常值得仔细研究。

但是还有很多地方没弄明白,例如网络,文件系统,vCPU,与docker对比,与containerd交互,与kvm交互等等。

附一个启动日志,可以看到其启动过程与kvm很类似,但用了OpenRC来代替linux传统的init进程。

sudo ./firecracker-v0.11.0 --api-sock /tmp/firecracker.sock
^@^@^@^@^@[    0.000000] Linux version 4.14.55-84.37.amzn2.x86_64 (mockbuild@ip-10-0-1-79) (gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)) #1 SMP Wed Jul 25 18:47:15 UTC 2018
[    0.000000] Command line: console=ttyS0 reboot=k panic=1 pci=off  root=/dev/vda virtio_mmio.device=4K@0xd0000000:5
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007ffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: KVM
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] e820: last_pfn = 0x8000 max_arch_pfn = 0x400000000
[    0.000000] MTRR: Disabled
[    0.000000] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[    0.000000] CPU MTRRs all blank - virtualized system.
[    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
[    0.000000] found SMP MP-table at [mem 0x0009fc00-0x0009fc0f] mapped at [ffffffffff200c00]
[    0.000000] Scanning 1 areas for low memory corruption
[    0.000000] Using GB pages for direct mapping
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x07fde000-0x07ffffff]
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:7fdc001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 400616677122 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x0000000007ffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x0000000007ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x0000000007ffffff]
[    0.000000] Intel MultiProcessor Specification v1.4
[    0.000000] MPTABLE: OEM ID: FC
[    0.000000] MPTABLE: Product ID: 000000000000
[    0.000000] MPTABLE: APIC at: 0xFEE00000
[    0.000000] Processor #0 (Bootup-CPU)
[    0.000000] IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
[    0.000000] Processors: 1
[    0.000000] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x000fffff]
[    0.000000] e820: [mem 0x08000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on KVM
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.000000] random: get_random_bytes called from start_kernel+0x94/0x486 with crng_init=0
[    0.000000] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:1 nr_node_ids:1
[    0.000000] percpu: Embedded 41 pages/cpu @ffff880007c00000 s128728 r8192 d31016 u2097152
[    0.000000] KVM setup async PF for cpu 0
[    0.000000] kvm-stealtime: cpu 0, msr 7c15040
[    0.000000] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32137
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: console=ttyS0 reboot=k panic=1 pci=off  root=/dev/vda virtio_mmio.device=4K@0xd0000000:5
[    0.000000] PID hash table entries: 512 (order: 0, 4096 bytes)
[    0.000000] Memory: 111064K/130680K available (8204K kernel code, 622K rwdata, 1464K rodata, 1268K init, 2820K bss, 19616K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] Kernel/User page tables isolation: enabled
[    0.004000] Hierarchical RCU implementation.
[    0.004000] 	RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=1.
[    0.004000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.004000] NR_IRQS: 4352, nr_irqs: 48, preallocated irqs: 16
[    0.004000] Console: colour dummy device 80x25
[    0.004000] console [ttyS0] enabled
[    0.004000] tsc: Detected 3292.374 MHz processor
[    0.004000] Calibrating delay loop (skipped) preset value.. 6584.74 BogoMIPS (lpj=13169496)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004000] Security Framework initialized
[    0.004000] SELinux:  Initializing.
[    0.004000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.004000] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.004000] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.004000] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.004305] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[    0.004782] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[    0.005304] Spectre V2 : Mitigation: Full generic retpoline
[    0.005731] Spectre V2 : Spectre v2 mitigation: Enabling Indirect Branch Prediction Barrier
[    0.006366] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.006897] Speculative Store Bypass: Vulnerable
[    0.017865] Freeing SMP alternatives memory: 28K
[    0.019147] smpboot: Max logical packages: 1
[    0.019641] x2apic enabled
[    0.020004] Switched APIC routing to physical x2apic.
[    0.021133] ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
[    0.021657] smpboot: CPU0: Intel(R) Xeon(R) Processor @ 3.30GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[    0.022471] Performance Events: unsupported p6 CPU model 60 no PMU driver, software events only.
[    0.023198] Hierarchical SRCU implementation.
[    0.023817] smp: Bringing up secondary CPUs ...
[    0.024000] smp: Brought up 1 node, 1 CPU
[    0.024000] smpboot: Total of 1 processors activated (6584.74 BogoMIPS)
[    0.024000] devtmpfs: initialized
[    0.024000] x86/mm: Memory block size: 128MB
[    0.024132] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.024868] futex hash table entries: 256 (order: 2, 16384 bytes)
[    0.025448] NET: Registered protocol family 16
[    0.025875] cpuidle: using governor ladder
[    0.026166] cpuidle: using governor menu
[    0.029273] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.029803] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.030604] dmi: Firmware registration failed.
[    0.031027] NetLabel: Initializing
[    0.031295] NetLabel:  domain hash size = 128
[    0.031640] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    0.032025] NetLabel:  unlabeled traffic allowed by default
[    0.032560] clocksource: Switched to clocksource kvm-clock
[    0.033008] VFS: Disk quotas dquot_6.6.0
[    0.033325] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.035661] NET: Registered protocol family 2
[    0.035661] TCP established hash table entries: 1024 (order: 1, 8192 bytes)
[    0.036171] TCP bind hash table entries: 1024 (order: 2, 16384 bytes)
[    0.036680] TCP: Hash tables configured (established 1024 bind 1024)
[    0.037219] UDP hash table entries: 256 (order: 1, 8192 bytes)
[    0.037675] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
[    0.038184] NET: Registered protocol family 1
[    0.039042] virtio-mmio: Registering device virtio-mmio.0 at 0xd0000000-0xd0000fff, IRQ 5.
[    0.039713] platform rtc_cmos: registered platform RTC device (no PNP device found)
[    0.040738] Scanning for low memory corruption every 60 seconds
[    0.041341] audit: initializing netlink subsys (disabled)
[    0.041933] Initialise system trusted keyrings
[    0.042285] Key type blacklist registered
[    0.042642] audit: type=2000 audit(1543422234.398:1): state=initialized audit_enabled=0 res=1
[    0.043343] workingset: timestamp_bits=36 max_order=15 bucket_order=0
[    0.044790] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.047096] Key type asymmetric registered
[    0.047423] Asymmetric key parser 'x509' registered
[    0.047821] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
[    0.048478] io scheduler noop registered (default)
[    0.048891] io scheduler cfq registered
[    0.049242] virtio-mmio virtio-mmio.0: Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.
[    0.050171] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    0.071606] serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a U6_16550A
[    0.073622] loop: module loaded
[    0.074310] tun: Universal TUN/TAP device driver, 1.6
[    0.074776] hidraw: raw HID events driver (C) Jiri Kosina
[    0.075240] nf_conntrack version 0.5.0 (1024 buckets, 4096 max)
[    0.075780] ip_tables: (C) 2000-2006 Netfilter Core Team
[    0.076244] Initializing XFRM netlink socket
[    0.076634] NET: Registered protocol family 10
[    0.077341] Segment Routing with IPv6
[    0.077647] NET: Registered protocol family 17
[    0.078001] Bridge firewalling registered
[    0.078350] sched_clock: Marking stable (76197330, 0)->(119073617, -42876287)
[    0.079048] registered taskstats version 1
[    0.079376] Loading compiled-in X.509 certificates
[    0.080370] Loaded X.509 cert 'Build time autogenerated kernel key: 3472798b31ba23b86c1c5c7236c9c91723ae5ee9'
[    0.081151] zswap: default zpool zbud not available
[    0.081524] zswap: pool creation failed
[    0.081912] Key type encrypted registered
[    0.083270] EXT4-fs (vda): recovery complete
[    0.083638] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
[    0.084251] VFS: Mounted root (ext4 filesystem) on device 254:0.
[    0.084830] devtmpfs: mounted
[    0.085606] Freeing unused kernel memory: 1268K
[    0.092056] Write protecting the kernel read-only data: 12288k
[    0.093373] Freeing unused kernel memory: 2016K
[    0.094823] Freeing unused kernel memory: 584K
2018-11-29T00:23:54.488369418 [:WARN:vmm/src/lib.rs:903] Guest-boot-time = 168891 us 168 ms, 162411 CPU us 162 CPU ms
OpenRC init version 0.35.5.87b1ff59c1 starting
Starting sysinit runlevel

   OpenRC 0.35.5.87b1ff59c1 is starting up Linux 4.14.55-84.37.amzn2.x86_64 (x86_64)

 * Mounting /proc ...
 [ ok ]
 * Mounting /run ...
 * /run/openrc: creating directory
 * /run/lock: creating directory
 * /run/lock: correcting owner
 * Caching service dependencies ...
Service `hwdrivers' needs non existent service `dev'
 [ ok ]
Starting boot runlevel
 * Remounting devtmpfs on /dev ...
 [ ok ]
 * Mounting /dev/mqueue ...
 [ ok ]
 * Mounting /dev/pts ...
 [ ok ]
 * Mounting /dev/shm ...
 [ ok ]
 * Setting hostname ...
 [ ok ]
 * Checking local filesystems  ...
 [ ok ]
 * Remounting filesystems ...
 [ ok[    0.202513] random: fast init done
 ]
 * Mounting local filesystems ...
 [ ok ]
 * Loading modules ...
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
 [ ok ]
 * Mounting misc binary format filesystem ...
 [ ok ]
 * Mounting /sys ...
 [ ok ]
 * Mounting security filesystem ...
 [ ok ]
 * Mounting debug filesystem ...
 [ ok ]
 * Mounting SELinux filesystem ...
 [ ok ]
 * Mounting persistent storage (pstore) filesystem ...
 [ ok ]
Starting default runlevel
 * Starting networking ...
 *   eth0 ...
Device "eth0" does not exist.
ifconfig: eth0: error fetching interface information: Device not found
ifconfig: SIOCSIFADDR: No such device
run-parts: /etc/network/if-up.d/firecracker-tap: exit status 1            [ !! ]
 *   eth1 ...
Device "eth1" does not exist.
ifconfig: eth1: error fetching interface information: Device not found
ifconfig: SIOCSIFADDR: No such device
run-parts: /etc/network/if-up.d/firecracker-tap: exit status 1            [ !! ]
 *   eth2 ...
Device "eth2" does not exist.
ifconfig: eth2: error fetching interface information: Device not found
ifconfig: SIOCSIFADDR: No such device
run-parts: /etc/network/if-up.d/firecracker-tap: exit status 1            [ !! ]
 * ERROR: networking failed to start
[    0.311853] random: sshd: uninitialized urandom read (40 bytes read)
 * Starting sshd.eth0 ...
[    0.316682] random: sshd: uninitialized urandom read (40 bytes read)   [ ok ]
[    1.056063] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2f75287227e, max_idle_ns: 440795361488 ns

Welcome to Alpine Linux 3.8
Kernel 4.14.55-84.37.amzn2.x86_64 on an x86_64 (ttyS0)

localhost login: root
Password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org>.

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

login[980]: root login on 'ttyS0'
localhost:~# ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 {openrc-init} /sbin/init
    2 root      0:00 [kthreadd]
    3 root      0:00 [kworker/0:0]