GIT 5ed6627ee96f0a6802d99e71879d98610ba17e01 git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git#master commit Author: Laurent Vivier Date: Tue Sep 25 13:36:40 2007 +0200 KVM: x86 emulator: Any legacy prefix after a REX prefix nullifies its effect This patch modifies the management of REX prefix according behavior I saw in Xen 3.1. In Xen, this modification has been introduced by Jan Beulich. http://lists.xensource.com/archives/html/xen-changelog/2007-01/msg00081.html Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 6972c9253725255034d0f8d83f5bdbf70430a95b Author: Laurent Vivier Date: Mon Sep 24 17:00:58 2007 +0200 KVM: Purify x86_decode_insn() error case management The only valid case is on protected page access, other cases are errors. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 2cd2d1d11d1f67c4d660c0cf6758dd6e588c4dd6 Author: Qing He Date: Mon Sep 24 17:39:41 2007 +0800 KVM: apic round robin cleanup If no apic is enabled in the bitmap of an interrupt delivery with delivery mode of lowest priority, a warning should be reported rather than select a fallback vcpu Signed-off-by: Qing He Signed-off-by: Eddie (Yaozu) Dong Signed-off-by: Avi Kivity commit 04817088a0e8d96587e4fb954d104d62f71df58d Author: Qing He Date: Mon Sep 24 17:22:13 2007 +0800 KVM: x86_emulator: no writeback for bt Signed-off-by: Qing He Signed-off-by: Avi Kivity commit a21855c2ed30a7a01468558bfc12a05722ef3771 Author: Laurent Vivier Date: Mon Sep 24 11:10:56 2007 +0200 KVM: x86 emulator: Remove no_wb, use dst.type = OP_NONE instead Remove no_wb, use dst.type = OP_NONE instead, idea stollen from xen-3.1 Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit d7f0f98414e3ab5259d54aa6ebd86a825af76980 Author: Laurent Vivier Date: Mon Sep 24 11:10:55 2007 +0200 KVM: x86 emulator: remove _eflags and use directly ctxt->eflags. Remove _eflags and use directly ctxt->eflags. Caching eflags is not needed as it is restored to vcpu by kvm_main.c:emulate_instruction() from ctxt->eflags only if emulation doesn't fail. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit d98df34cc539942d8d5540ffa2425ca91056a7d3 Author: Laurent Vivier Date: Mon Sep 24 11:10:54 2007 +0200 KVM: x86 emulator: split some decoding into functions for readability To improve readability, move push, writeback, and grp 1a/2/3/4/5/9 emulation parts into functions. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 62d1ea7fdcdb905072198e4cec8f724c8ad33092 Author: Ryan Harper Date: Tue Sep 18 14:05:16 2007 -0500 KVM: MMU: Ignore reserved bits in cr3 in non-pae mode This patch removes the fault injected when the guest attempts to set reserved bits in cr3. X86 hardware doesn't generate a fault when setting reserved bits. The result of this patch is that vmware-server, running within a kvm guest, boots and runs memtest from an iso. Signed-off-by: Ryan Harper Signed-off-by: Avi Kivity commit 4acc535e64696fb09da6d2f41a5a8b8f60739c03 Author: Avi Kivity Date: Sun Sep 23 14:10:49 2007 +0200 KVM: MMU: Make flooding detection work when guest page faults are bypassed When we allow guest page faults to reach the guests directly, we lose the fault tracking which allows us to detect demand paging. So we provide an alternate mechnism by clearing the accessed bit when we set a pte, and checking it later to see if the guest actually used it. Signed-off-by: Avi Kivity commit 30649900566e8e8785b814f4a40e6660d8086873 Author: Avi Kivity Date: Sun Sep 16 18:58:32 2007 +0200 KVM: Allow not-present guest page faults to bypass kvm There are two classes of page faults trapped by kvm: - host page faults, where the fault is needed to allow kvm to install the shadow pte or update the guest accessed and dirty bits - guest page faults, where the guest has faulted and kvm simply injects the fault back into the guest to handle The second class, guest page faults, is pure overhead. We can eliminate some of it on vmx using the following evil trick: - when we set up a shadow page table entry, if the corresponding guest pte is not present, set up the shadow pte as not present - if the guest pte _is_ present, mark the shadow pte as present but also set one of the reserved bits in the shadow pte - tell the vmx hardware not to trap faults which have the present bit clear With this, normal page-not-present faults go directly to the guest, bypassing kvm entirely. Unfortunately, this trick only works on Intel hardware, as AMD lacks a way to discriminate among page faults based on error code. It is also a little risky since it uses reserved bits which might become unreserved in the future, so a module parameter is provided to disable it. Signed-off-by: Avi Kivity commit afa232aeb1676c63c5c4000a6d865cdc9455b2b5 Author: Izik Eidus Date: Sun Sep 23 12:30:19 2007 +0200 KVM: MMU: Set shadow pte atomically in mmu_pte_write_zap_pte() Setting shadow page table entry should be set atomicly using set_shadow_pte(). Signed-off-by: Izik Eidus Signed-off-by: Avi Kivity commit 99f6c824362215f3038cfe54ddcd3c940281e9cd Author: Avi Kivity Date: Fri Sep 21 05:29:13 2007 +0200 KVM: Fix ioapic edge-triggered interrupts - clear irr after service - do not service after unmasking; wait for a new edge Signed-off-by: Avi Kivity commit 5fdd2a196e7975d446fedf6973cbb20708f1359c Author: Avi Kivity Date: Thu Sep 20 12:27:28 2007 +0200 KVM: Fix host oops due to guest changing efer If the guest changes efer from long mode with sce disabled to legacy mode, then load_transition_efer() zeros vmx->host_state.guest_efer_loaded, but the SCE-disabled efer remains in effect. So when we return to the host, we disable SCE and syscalls no longer work. Fix by (a) not touching vmx->host_state.guest_efer_loaded if we're not setting it, and instead (b) clearing it explicitly when we switch back. Also switch back when the guest writes to efer so we start from a clean slate. Signed-off-by: Avi Kivity commit e8ebaa91f96407a90c1cb81708a87a25f40ba8ab Author: Laurent Vivier Date: Thu Sep 20 11:17:24 2007 +0200 KVM: x86 emulator: fix repne/repnz decoding The repnz/repne instructions must set rep_prefix to 1 like rep/repe/repz. This patch correct the disk probe problem met with OpenBSD. This issue appears with commit 091b206f6c56f2329e11bac2fa40d6f236ab0bc2 because before it, the decoding was done internally to kvm and after it is done by x86_emulate.c (which doesn't do it correctly). Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 0203e2d5d0d0cea6eed6e437d9456aad71135913 Author: Avi Kivity Date: Wed Sep 19 16:08:53 2007 +0200 KVM: Implement ioapic irq polarity bit Reverse the sense of the irq level if the polarity bit is set. Signed-off-by: Avi Kivity commit faba110779451794f764a4802e740146e8efb93f Author: Avi Kivity Date: Wed Sep 19 15:48:58 2007 +0200 KVM: Avoid redelivery of edge-triggered irq if it is already in service Noticed by Eddie Dong. Signed-off-by: Avi Kivity commit 3ddc321087b0083fec2eff1bc613410fdc2e8388 Author: Laurent Vivier Date: Wed Sep 19 13:41:55 2007 +0200 KVM: x85 emulator: Correct inconcistency in between cr2 and ctxt->cr2. This patch corrects an inconcistency of cr2 introduced by the x86 emulator split. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 37d266e7330baedc52fab5cc699cff2c8cc2947e Author: Nitin A Kamble Date: Tue Sep 18 16:34:25 2007 -0700 KVM: x86 emulator: fix merge screwup due to emulator split This code has gone to wrong place in the file. Moving it back to right location. Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 1196dd4e2e6f09053335b0a91e3cc793808c00a7 Author: Avi Kivity Date: Wed Sep 19 10:52:18 2007 +0200 KVM: Fix ioapic.c compilation failure due to missing include Signed-off-by: Avi Kivity commit 44a0469583ff93240acb76085a993a1d30202679 Author: Avi Kivity Date: Wed Sep 19 10:44:58 2007 +0200 KVM: VMX: Fix build on i386 due to EFER_LMA not defined commit caba4b5c24f24bf003dd385e5658f0b682bdf0ac Author: Avi Kivity Date: Wed Aug 29 03:48:05 2007 +0300 KVM: VMX: Further reduce efer reloads KVM avoids reloading the efer msr when the difference between the guest and host values consist of the long mode bits (which are switched by hardware) and the NX bit (which is emulated by the KVM MMU). This patch also allows KVM to ignore SCE (syscall enable) when the guest is running in 32-bit mode. This is because the syscall instruction is not available in 32-bit mode on Intel processors, so the SCE bit is effectively meaningless. Signed-off-by: Avi Kivity commit 97594d420a09db38e3f2c8aa2c8481dc51c11e82 Author: Avi Kivity Date: Tue Sep 18 15:26:30 2007 +0200 KVM: Fix #UD exception delivery It doesn't have an error code, and it uses the #UD vector. Signed-off-by: Avi Kivity commit 5e7a195fc4b1c0df577658e01a25b91d49bc68ee Author: Avi Kivity Date: Tue Sep 18 14:19:00 2007 +0200 KVM: Fix ioapic level-triggered interrupt redelivery The ioapic failed to set the irr bit if a previous interrupt was already being serviced. This caused interrupt loss fairly soon, leading to loss of level-triggered devices like pic networking. This patch fixes the problem by always setting irr when an irq is asserted. Signed-off-by: Avi Kivity commit 5d9b36eec8ca6abe03da91efdfc7b5861525bd43 Author: Laurent Vivier Date: Tue Sep 18 11:27:37 2007 +0200 KVM: Call x86_decode_insn() only when needed Move emulate_ctxt to kvm_vcpu to keep emulate context when we exit from kvm module. Call x86_decode_insn() only when needed. Modify x86_emulate_insn() to not modify the context if it must be re-entered. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit a00840cfcc2c18662e04ac94fcbe12266c403cad Author: Laurent Vivier Date: Tue Sep 18 11:27:27 2007 +0200 KVM: emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn() emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn(). x86_emulate_insn() is x86_emulate_memop() without the decoding part. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit a40bf553563276cf3aff293b6ec36373bf3dc968 Author: Laurent Vivier Date: Tue Sep 18 11:27:19 2007 +0200 KVM: x86 emulator: move all decoding process to function x86_decode_insn() Split the decoding process into a new function x86_decode_insn(). Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit c18617e89f3a94fd74d55dde36b54c8ca23072f9 Author: Laurent Vivier Date: Tue Sep 18 11:52:50 2007 +0200 KVM: x86 emulator: move all x86_emulate_memop() to a structure Move all x86_emulate_memop() common variables between decode and execute to a structure decode_cache. This will help in later separating decode and emulate. struct decode_cache { u8 twobyte; u8 b; u8 lock_prefix; u8 rep_prefix; u8 op_bytes; u8 ad_bytes; struct operand src; struct operand dst; unsigned long *override_base; unsigned int d; unsigned long regs[NR_VCPU_REGS]; unsigned long eip; /* modrm */ u8 modrm; u8 modrm_mod; u8 modrm_reg; u8 modrm_rm; u8 use_modrm_ea; unsigned long modrm_ea; unsigned long modrm_val; }; Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit d6a754e5ec1ae429e7bd22a2b54e0fea1d64e1d9 Author: Laurent Vivier Date: Tue Sep 18 11:26:38 2007 +0200 KVM: x86 emulator: remove unused functions Remove #ifdef functions never used Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 8db6e3d54971e76019b02a6ad860c9df35218dfc Author: Anthony Liguori Date: Mon Sep 17 14:57:50 2007 -0500 KVM: Refactor hypercall infrastructure (v3) This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit cd132dabf169ce61b5be58d42d9bd08984cba429 Author: Anthony Liguori Date: Mon Sep 17 14:57:49 2007 -0500 KVM: x86 emulator: Add vmmcall/vmcall to x86_emulate (v3) Add vmmcall/vmcall to x86_emulate. Future patch will implement functionality for these instructions. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 17f668f73876cb0a67404db12b843850a9426bbb Author: Avi Kivity Date: Mon Sep 17 11:02:51 2007 +0200 KVM: Fix virtualization menu help text What guest drivers? Signed-off-by: Avi Kivity commit 23263a086e85f0065f95e3cb676dd96434da98d8 Author: Avi Kivity Date: Mon Sep 17 10:59:31 2007 +0200 KVM: Remove errant printk() in kvm_vcpu_ioctl_get_sregs() Signed-off-by: Avi Kivity commit 718a3339a903ea1935148eb095c2f8ce741a54a2 Author: Avi Kivity Date: Mon Sep 17 10:58:27 2007 +0200 KVM: Fix kvm_vcpu_ioctl_get_sregs() warning on i386 Signed-off-by: Avi Kivity commit 37fa44d29fb9d9c47126e40bfe266f8bf74de43d Author: Qing He Date: Mon Sep 17 14:47:13 2007 +0800 KVM: fix PIC interrupt delivery on different APIC conditions This patch changes the PIC interrupts delivery. Now it is only deliverd to vcpu0 when either condition is met (on vcpu0): 1. local APIC is hardware disabled 2. LVT0 is unmasked and configured to delivery mode ExtInt It fixes the 2x faster wall clock on x86_64 and SMP i386 Linux guests Signed-off-by: Qing He Signed-off-by: Avi Kivity commit c408e4e8d9045d53c1d82c622a5756febd051ef9 Author: Avi Kivity Date: Sat Sep 15 17:34:36 2007 +0300 KVM: Skip pio instruction when it is emulated, not executed If we defer updating rip until pio instructions are executed, we have a problem with reset: a pio reset updates rip, and when the instruction completes we skip the emulated instruction, pointing rip somewhere completely unrelated. Fix by updating rip when we see decode the instruction, not after emulation. Signed-off-by: Avi Kivity commit 340bcebdee0382c3b1dd9f963e21e4217594467b Author: Nitin A Kamble Date: Sat Sep 15 10:45:05 2007 +0300 KVM: x86 emulator: popf Implement emulation of instruction: popf opcode: 0x9d Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 338be073d760fde05ee2f78ace4e9576dc3e6909 Author: Nitin A Kamble Date: Sat Sep 15 10:43:33 2007 +0300 KVM: x86 emulator: fix src, dst value initialization Some operand fetches are less than the machine word size and can result in stale bits if used together with operands of different sizes. Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit a5f988993b5b167d007c8c33d45ad8e0f849d22a Author: Nitin A Kamble Date: Sat Sep 15 10:41:26 2007 +0300 KVM: x86 emulator: jmp abs Implement emulation of instruction: jump absolute r/m opcode: 0xff /4 Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit c6aeb4632919c37317213c9b5a41bedbcc8b3416 Author: Nitin A Kamble Date: Sat Sep 15 10:35:36 2007 +0300 KVM: x86 emulator: lea Implement emulation of instruction lea r16/r32, m opcode: 0x8d: Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit c9aa71c7901df4fcf1eb33721bea8b581188f2bf Author: Nitin A Kamble Date: Sat Sep 15 10:25:41 2007 +0300 KVM: X86 emulator: jump conditional short Implement emulation of more jump conditional instructions jcc shortrel opcodes: 0x70 - 0x7f Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit ef70803677eec7ab37d61e48b0d1cb628c3f991b Author: Nitin A Kamble Date: Sat Sep 15 10:23:07 2007 +0300 KVM: x86 emulator: imlpement jump conditional relative Implement emulation of instruction: jump conditional rel opcodes: 0x0f 0x80 - 0x0f 0x8f Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 021854e6f9566e2990c1dfee474d4d509f84e3fd Author: Nitin A Kamble Date: Sat Sep 15 10:13:07 2007 +0300 KVM: x86 emulator: sort opcodes into ascending order Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 28621bdce24324e1f8b33fa25595cc0609153be6 Author: Sheng Yang Date: Wed Sep 12 18:03:11 2007 +0800 KVM: VMX: Prevent setting CPU_BASED_TPR_SHADOW on i386 host Though tpr shadow feature can be used on i386 host, but it needs support from virtual apic access feature which hasn't been implemented yet, otherwise it will cause trouble on i386 machine which supports this feature. This patch disables tpr shadow feature for i386 host now. Signed-off-by: Sheng Yang Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 58d8159c7264eee015ad0656afd018aecbb3c69f Author: Avi Kivity Date: Wed Sep 12 13:21:09 2007 +0300 KVM: Improve emulation failure reporting Report failed opcodes from all locations. Signed-off-by: Avi Kivity commit 0fe149eb04e5e67f4d3ebc2ab9f2426356a308ba Author: Nitin A Kamble Date: Tue Aug 28 18:22:47 2007 -0700 KVM: x86 emulator: pushf Implement emulation of instruction pushf opcode: 0x9c Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 917fca3116605cfa2db01d390ba94b0412c88eb3 Author: Nitin A Kamble Date: Tue Aug 28 18:08:37 2007 -0700 KVM: x86 emulator: call near Implement emulation of instruction opcode: 0xe8 call (near) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 2b15dbf376872112c832576cc3b3f618e0b85e2d Author: Nitin A Kamble Date: Tue Aug 28 17:58:52 2007 -0700 KVM: x86 emulator: push imm8 Implement the instruction push imm8 opcode: 0x6a Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit c7c2eaa2c305ed106da78afd7b95f42cec8d6dc8 Author: He, Qing Date: Wed Sep 12 14:18:28 2007 +0800 KVM: VMX: Fix exit qualification width on i386 According to Intel Software Developer's Manual, Vol. 3B, Appendix H.4.2, exit qualification should be of natural width. However, current code uses u64 as the data type for this register, which occasionally introduces invalid value to VMExit handling logics. This patch fixes this bug. I have tested Windows and Linux guest on i386 host, and they can boot successfully with this patch. Signed-off-by: Qing He Signed-off-by: Avi Kivity commit de60b339983ae64920b1bc58bb5c2c6b10db5d93 Author: Eddie Dong Date: Wed Sep 12 10:58:04 2007 +0300 KVM: Fix link error to "genapic" GET_APIC_ID may use genapic instance for some machine configuration in i386 architecture, but it is not exported for outside usage. This patch remove this reference. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 0663a73e366dd1df52c0ce4fec32f47455575324 Author: Avi Kivity Date: Mon Sep 10 18:10:54 2007 +0300 KVM: Move main vcpu loop into subarch independent code This simplifies adding new code as well as reducing overall code size. Signed-off-by: Avi Kivity commit ec1ff57323b7ce5022cae99e27ff8297ce2aaa27 Author: Avi Kivity Date: Mon Sep 10 17:27:03 2007 +0300 KVM: VMX: Move vm entry failure handling to the exit handler This will help moving the main loop to subarch independent code. Signed-off-by: Avi Kivity commit 2f6ad5e1fb93c0392e6cfaf2ef2bee3aaaa19244 Author: Avi Kivity Date: Mon Sep 10 15:20:59 2007 +0300 KVM: Remove smp_processor_id() in kvm_vcpu_kick() The value is meaningless since it can change; instead call the function unconditionally. It is a no-op on the same cpu anyway. This removes annoying warning on runtime. Signed-off-by: Avi Kivity commit a40fa4c30f0883a3a4a1560e0174540d6594e0ca Author: Avi Kivity Date: Mon Sep 10 11:28:17 2007 +0300 KVM: MMU: Don't do GFP_NOWAIT allocations Before preempt notifiers, kvm needed to allocate memory with GFP_NOWAIT so as not to have to enable preemption and take a heavyweight exit. On oom, we'd fall back to a GFP_KERNEL allocation. With preemption notifiers, we can do a GFP_KERNEL allocation, and perform the heavyweight exit only if the kernel decides to put us to sleep. Signed-off-by: Avi Kivity commit 5b25a47c1edb6ba9ac23e745260e7533be371c1d Author: He, Qing Date: Mon Sep 10 11:01:52 2007 +0300 KVM: fix apic timer migration when inactive When local apic timer is inactive or is expired in one shot mode, it should not be restarted on vcpu and hrtimer migration. This patch fixes this. Signed-off-by: Qing He Signed-off-by: Avi Kivity commit e44af0f4ee99974ce40102e23784bc3cae7f4466 Author: Jindrich Makovicka Date: Sun Sep 9 18:45:01 2007 +0300 KVM: Fix lapic 64-bit division on 32-bit hosts Signed-off-by: Avi Kivity commit c54c215e7e71b99c0a3270d7fc85668179bea67a Author: Christian Ehrhardt Date: Sun Sep 9 15:41:59 2007 +0300 KVM: Rename kvm_arch_ops to kvm_x86_ops This patch just renames the current (misnamed) _arch namings to _x86 to ensure better readability when a real arch layer takes place. Signed-off-by: Christian Ehrhardt Signed-off-by: Avi Kivity commit beec957bd8205ebbd9dc2eecb166fe4ae06e31e4 Author: Laurent Vivier Date: Thu Aug 30 14:56:21 2007 +0200 KVM: Simplify memory allocation The mutex->splinlock convertion alllows us to make some code simplifications. As we can keep the lock longer, we don't have to release it and then have to check if the environment has not been modified before re-taking it. We can remove kvm->busy and kvm->memory_config_version. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit f71c862171a7265085798d1aa8c43eadb6d85520 Author: Rusty Russell Date: Thu Sep 6 01:21:32 2007 +1000 KVM: Hoist SVM's get_cs_db_l_bits into core code. SVM gets the DB and L bits for the cs by decoding the segment. This is in fact the completely generic code, so hoist it for kvm-lite to use. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 1cfd09dd0492b50376ed703f4252c489d91d1597 Author: Rusty Russell Date: Thu Sep 6 01:20:38 2007 +1000 KVM: Keep control regs in sync We don't update the vcpu control registers in various places. We should do so. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit aa38840d3d2e0a804e628077df8d8879b496d741 Author: Rusty Russell Date: Sun Sep 9 14:12:54 2007 +0300 KVM: Clean up unloved invlpg emulation invlpg shouldn't fetch the "src" address, since it may not be valid, however SVM's "solution" which neuters emulation of all group 7 instruction is horrible and breaks kvm-lite. The simplest fix is to put a special check in for invlpg. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 8fa7178a4f0c96662bab31ff46e3bff1995ff14a Author: Rusty Russell Date: Sun Sep 9 14:10:57 2007 +0300 KVM: Remove the unused invlpg member of struct kvm_arch_ops. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit f4ed61146e11e6f42500385feee51eb30ed324d0 Author: Amit Shah Date: Sat Aug 25 11:35:52 2007 +0300 KVM: Set the ET flag in CR0 after initializing FX This was missed when moving stuff around in fbc4f2e Fixes Solaris guests and bug #1773613 Signed-off-by: Amit Shah Signed-off-by: Avi Kivity commit fe3f479c1a1564b3612a955f037a1905b85c9f6f Author: He, Qing Date: Mon Sep 3 17:07:41 2007 +0300 KVM: enable in-kernel APIC INIT/SIPI handling This patch enables INIT/SIPI handling using in-kernel APIC by introducing a ->mp_state field to emulate the SMP state transition. Signed-off-by: Qing He Signed-off-by: Xin Li commit a0c1343ffdac844fe659678928d1eb6c88e8aeb4 Author: He, Qing Date: Mon Sep 3 17:01:36 2007 +0300 KVM: round robin for APIC lowest priority delivery mode Signed-off-by: Qing He Signed-off-by: Avi Kivity commit ca7d5e3ddce0d1483fbb28ba59d7677c8935d785 Author: Eddie Dong Date: Mon Sep 3 17:00:24 2007 +0300 KVM: deliver PIC interrupt only to vcpu0 Signed-off-by: Eddie (Yaozu) Dong Signed-off-by: Avi Kivity commit 1fef0a7c83cc8ce89c6ea25225898da57ea68a63 Author: Eddie Dong Date: Mon Sep 3 16:56:58 2007 +0300 KVM: VMX: Fix tpr threshold updating The TPR threshold must be updated only after any irqs are injected. Signed-off-by: Eddie (Yaozu) Dong Signed-off-by: Avi Kivity commit 43803341a9873c3955f10352b30f5d449aae70b5 Author: He, Qing Date: Thu Aug 30 17:04:26 2007 +0800 KVM: disable tpr/cr8 sync when in-kernel APIC is used Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 966b840d0d7f52fdf2062772a1477f4d2536ab8f Author: Eddie Dong Date: Mon Sep 3 16:15:12 2007 +0300 KVM: Migrate lapic hrtimer when vcpu moves to another cpu This reduces overhead by accessing cachelines from the wrong node, as well as simplifying locking. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 50587b4ba6352cb87212b581f3e6a4b21ee5ff7f Author: Eddie Dong Date: Sat Aug 25 18:00:41 2007 +0300 KVM: Keep track of missed timer irq injections APIC timer IRQ is set every time when a certain period expires at host time, but the guest may be descheduled at that time and thus the irq be overwritten by later fire. This patch keep track of firing irq numbers and decrease only when the IRQ is injected to guest or buffered in APIC. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 09fe5ff51331cc089d9d02b33ef0fc512dcd8f69 Author: Yang, Sheng Date: Mon Sep 3 16:37:44 2007 +0300 KVM: VMX: Use shadow TPR/cr8 for 64-bits guests This patch enables TPR shadow of VMX on CR8 access. 64bit Windows using CR8 access TPR frequently. The TPR shadow can improve the performance of access TPR by not causing vmexit. Signed-off-by: Sheng Yang Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit b2954a203243fa837c8c867163b00d2dee278048 Author: Eddie Dong Date: Mon Aug 6 16:29:07 2007 +0300 KVM: pending irq save/restore Add in kernel irqchip save/restore support for pending vectors. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 6546954a502dedf79401c7e5f564573df78e2f61 Author: Eddie Dong Date: Thu Sep 6 12:22:56 2007 +0300 KVM: in-kernel LAPIC save and restore support This patch adds a new vcpu-based IOCTL to save and restore the local apic registers for a single vcpu. The kernel only copies the apic page as a whole, extraction of registers is left to userspace side. On restore, the APIC timer is restarted from the initial count, this introduces a little delay, but works fine. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 3ae9777fe4c0fdc9688a8976ed105f537b5e2aea Author: He, Qing Date: Sun Aug 5 10:49:16 2007 +0300 KVM: in-kernel IOAPIC save and restore support This patch adds support for in-kernel ioapic save and restore (to and from userspace). It uses the same get/set_irqchip ioctl as in-kernel PIC. Signed-off-by: Qing He Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 8002f6c199d9c8535ee7848f55c35d012433e75a Author: He, Qing Date: Thu Aug 2 14:03:07 2007 +0300 KVM: Bypass irq_pending get/set when using in kernel irqchip vcpu->irq_pending is saved in get/set_sreg IOCTL, but when in-kernel local APIC is used, doing this may occasionally overwrite vcpu->apic to an invalid value, as in the vm restore path. Signed-off-by: Qing He commit aae5fefba1de58a016d5c49c92c79a58ed989721 Author: He, Qing Date: Thu Jul 26 11:05:18 2007 +0300 KVM: Add get/set irqchip ioctls for in-kernel PIC live migration support This patch adds two new ioctls to dump and write kernel irqchips for save/restore and live migration. PIC s/r and l/m is implemented in this patch. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 6e031384c47b825dd70a6c412f214d3379ed1f83 Author: Eddie Dong Date: Sun Jul 22 10:36:31 2007 +0300 KVM: Protect in-kernel pio using kvm->lock pio operation and IRQ_LINE kvm_vm_ioctl is not kvm->lock protected. Add lock to same with IOAPIC MMIO operations. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 38557cfd5156f87f7f75623600b6b8c30e4e1ace Author: Eddie Dong Date: Wed Jul 18 12:15:21 2007 +0300 KVM: Emulate hlt in the kernel By sleeping in the kernel when hlt is executed, we simplify the in-kernel guest interrupt path considerably. Signed-off-by: Gregory Haskins Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 1fd2279b67369669c95e4474a9a1e0b0b6fbb060 Author: Eddie Dong Date: Wed Jul 18 12:03:39 2007 +0300 KVM: In-kernel I/O APIC model This allows in-kernel host-side device drivers to raise guest interrupts without going to userspace. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit d69018f178ceb0cbb93fdb4795b1b503c6899162 Author: Eddie Dong Date: Thu Sep 6 12:22:56 2007 +0300 KVM: Emulate local APIC in kernel Because lightweight exits (exits which don't involve userspace) are many times faster than heavyweight exits, it makes sense to emulate high usage devices in the kernel. The local APIC is one such device, especially for Windows and for SMP, so we add an APIC model to kvm. It also allows in-kernel host-side drivers to inject interrupts without going through userspace. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 7ae590c3ab0dab8a6e94965ccc2e77f9c6309a8d Author: Eddie Dong Date: Wed Jul 18 11:34:57 2007 +0300 KVM: Define and use cr8 access functions This patch is to wrap APIC base register and CR8 operation which can provide a unique API for user level irqchip and kernel irqchip. This is a preparation of merging lapic/ioapic patch. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 127992c791d57fa7646e5ee8de60360ea3c0bd59 Author: Eddie Dong Date: Fri Jul 6 12:20:49 2007 +0300 KVM: Add support for in-kernel PIC emulation Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 263773f7a6606efda85f7b184a067b5f560ed39b Author: Laurent Vivier Date: Thu Aug 23 16:33:11 2007 +0200 KVM: VMX: Split segments reload in vmx_load_host_state() vmx_load_host_state() bundles fs, gs, ldt, and tss reloading into one in the hope that it is infrequent. With smp guests, fs reloading is frequent due to fs being used by threads. Unbundle the reloads so reduce expensive gs reloads. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit f3e0aa2b4593e7d5dd064a3c56c919f85ef0d9eb Author: Avi Kivity Date: Wed Aug 22 18:09:29 2007 +0300 KVM: X86 emulator: fix 'push reg' writeback Pointed out by Rusty Russell. Signed-off-by: Avi Kivity commit 46a948d80db725870da4ebdf5893d8efc426446d Author: Izik Eidus Date: Mon Aug 20 18:11:00 2007 +0300 KVM: Support more memory slots Needed for mapping memory at 4GB. Signed-off-by: Izik Eidus Signed-off-by: Avi Kivity commit a88bbc1699461bab7479fdcef3ea1c12069acd1f Author: Avi Kivity Date: Mon Aug 20 15:44:39 2007 +0300 KVM: MMU: Fix rare oops on guest context switch A guest context switch to an uncached cr3 can require allocation of shadow pages, but we only recycle shadow pages in kvm_mmu_page_fault(). Move shadow page recycling to mmu_topup_memory_caches(), which is called from both the page fault handler and from guest cr3 reload. Signed-off-by: Avi Kivity commit feeb915ce6cd7a5f51b2e56b6ff8dffb959a9594 Author: Avi Kivity Date: Sun Aug 19 13:51:00 2007 +0300 KVM: Avoid calling smp_call_function_single() with interrupts disabled When taking a cpu down, we need to hardware_disable() it. Unfortunately, the CPU_DYING notifier is called with interrupts disabled, which means we can't use smp_call_function_single(). Fortunately, the CPU_DYING notifier is always called on the dying cpu, so we don't need to use the function at all and can simply call hardware_disable() directly. Tested-by: Paolo Ornati Signed-off-by: Avi Kivity commit 086f2ee50db8a1f39b0e17ab17d9c79b5964f0d7 Author: Izik Eidus Date: Sun Aug 19 22:24:58 2007 +0300 KVM: VMX: allow rmode_tss_base() to work with >2G of guest memory Signed-off-by: Izik Eidus Signed-off-by: Avi Kivity commit a843332b0445c9d60e4c9bda965b10cbe632a088 Author: Nitin A Kamble Date: Sun Aug 19 11:07:06 2007 +0300 KVM: x86 emulator: implement 'push reg' (opcodes 0x50-0x57) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit b23f94e52c4cec9e5ad404ac1426c49d64902dbf Author: Nitin A Kamble Date: Sun Aug 19 11:03:13 2007 +0300 KVM: x86 emulator: Implement 'jmp rel short' instruction (opcode 0xeb) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 1bee2245738d473f57dcc973e16a497ff800c026 Author: Nitin A Kamble Date: Sun Aug 19 11:00:36 2007 +0300 KVM: x86 emulator: implement 'jmp rel' instruction (opcode 0xe9) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 1757c2eb9f4a6b3b3aac1026dd15f26eb5dea041 Author: Nitin A Kamble Date: Fri Aug 17 15:17:41 2007 +0300 KVM: x86 emulator: implement 'and $imm, %{al|ax|eax}' Implement emulation of instruction and al imm8 (opcode 0x24) and ax/eax imm16/imm32 (opcode 0x25) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 9bcbb7df435e154b148662df3721f71847a9342c Author: Sheng Yang Date: Thu Aug 16 13:01:00 2007 +0300 KVM: Communicate cr8 changes to userspace This allows running 64-bit Windows. Signed-off-by: Sheng Yang Signed-off-by: Avi Kivity commit 93d097821cce141b3c74bbf20735c6dde443715f Author: Avi Kivity Date: Wed Aug 15 15:23:34 2007 +0300 KVM: Close minor race in signal handling We need to check for signals inside the critical section, otherwise a signal can be sent which we will not notice. Also move the check before entry, so that if the signal happens before the first entry, we exit immediately instead of waiting for something to happen to the guest. Signed-off-by: Avi Kivity commit 83aecfbf44f3ba92abde47957a3c9175f1ec7165 Author: Glauber de Oliveira Costa Date: Wed Aug 15 05:36:45 2007 +0300 KVM: VMX: Don't require cr8 load/store exit capability when running on 32-bit This is because cr8 is not available on IA-32. It is just used in 64-bit mode. The rdmsr will then report this as not present, and it will lead us to return an -EIO. Signed-off-by: Glauber de Oliveira Costa Signed-off-by: Avi Kivity commit 7a3773c7d8a0b488e86b98571e5b858a222b12a5 Author: Laurent Vivier Date: Sun Aug 5 10:43:32 2007 +0300 KVM: Clean up kvm_setup_pio() Split kvm_setup_pio() into two functions, one to setup in/out pio (kvm_emulate_pio()) and one to setup ins/outs pio (kvm_emulate_pio_string()). Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 091b206f6c56f2329e11bac2fa40d6f236ab0bc2 Author: Laurent Vivier Date: Sun Aug 5 10:36:40 2007 +0300 KVM: Cleanup string I/O instruction emulation Both vmx and svm decode the I/O instructions, and both botch the job, requiring the instruction prefixes to be fetched in order to completely decode the instruction. So, if we see a string I/O instruction, use the x86 emulator to decode it, as it already has all the prefix decoding machinery. This patch defines ins/outs opcodes in x86_emulate.c and calls emulate_instruction() from io_interception() (svm.c) and from handle_io() (vmx.c). It removes all vmx/svm prefix instruction decoders (get_addr_size(), io_get_override(), io_address(), get_io_count()) Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 3415130f97a18f354853cab694d392553aa51af8 Author: Laurent Vivier Date: Wed Aug 1 21:51:09 2007 +0300 KVM: Remove useless assignment Line 1809 of kvm_main.c is useless, value is overwritten in line 1815: 1809 now = min(count, PAGE_SIZE / size); 1810 1811 if (!down) 1812 in_page = PAGE_SIZE - offset_in_page(address); 1813 else 1814 in_page = offset_in_page(address) + size; 1815 now = min(count, (unsigned long)in_page / size); 1816 if (!now) { Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 777d55128214491c48efa4b88355c3fa38c3b057 Author: Li, Xin B Date: Wed Aug 1 21:49:10 2007 +0300 KVM: VMX: Remove a duplicated ia32e mode vm entry control Remove a duplicated ia32e mode VM Entry control definition and use the proper one. Signed-off-by: Xin Li Signed-off-by: Avi Kivity commit b6d8a8dd56ee037b64af90085dd4bd54cbf16ac5 Author: Rusty Russell Date: Wed Aug 1 14:46:11 2007 +1000 KVM: Use kmem_cache_free for kmem_cache_zalloc'ed objects We use kfree in svm.c and vmx.c, and this works, but it could break at any time. kfree() is supposed to match up with kmalloc(). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit c4a026d15eecb4ec9769210aa31d7992f2b87c74 Author: Rusty Russell Date: Wed Aug 1 10:48:02 2007 +1000 KVM: Add and use pr_unimpl for standard formatting of unimplemented features All guest-invokable printks should be ratelimited to prevent malicious guests from flooding logs. This is a start. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 07b7ac315dfb33f8e56a3c19572a96318b8cbc43 Author: Rusty Russell Date: Wed Aug 1 10:17:06 2007 +1000 KVM: Remove unneeded kvm_dev_open and kvm_dev_release functions. Devices don't need open or release functions. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 71cda75733ec3020a5ce57a31eaf300f007c67b2 Author: Rusty Russell Date: Wed Aug 1 10:12:22 2007 +1000 KVM: Remove stat_set from debugfs We shouldn't define stat_set on the debug attributes, since that will cause silent failure on writing: without a set argument, userspace will get -EACCESS. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 6e0bfce30aa37a7fd5dd9c041296dfa237dae728 Author: Gabriel C Date: Wed Aug 1 16:23:10 2007 +0200 KVM: Fix defined but not used warning in drivers/kvm/vmx.c move_msr_up() is used only on X86_64 and generates a warning on !X86_64 Signed-off-by: Gabriel Craciunescu Signed-off-by: Avi Kivity commit e203ad4bcf11981df6fc1677fedbdb29f6fa38e8 Author: Rusty Russell Date: Tue Jul 31 20:46:12 2007 +1000 KVM: Remove redundant alloc_vmcs_cpu declaration alloc_vmcs_cpu is already declared (static) above, no need to redeclare. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 8716bbed1f90ec805ca20a0c3264e181278c08cd Author: Rusty Russell Date: Tue Jul 31 20:42:42 2007 +1000 KVM: SVM: Make set_msr_interception more reliable set_msr_interception() is used by svm to set up which MSRs should be intercepted. It can only fail if someone has changed the code to try to intercept an MSR without updating the array of ranges. The return value is ignored anyway: it should just BUG() if it doesn't work. (A build-time failure would be better, but that's tricky). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit a3573510c9b6a93fffaa118e58494d439c37a17a Author: Rusty Russell Date: Tue Jul 31 20:41:14 2007 +1000 KVM: Cleanup mark_page_dirty For some reason, mark_page_dirty open-codes __gfn_to_memslot(). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 087ba994ef1267032319ff2ec2d8addb8bc5a567 Author: Rusty Russell Date: Tue Jul 31 20:45:03 2007 +1000 KVM: Don't assign vcpu->cr3 if it's invalid: check first, set last sSigned-off-by: Rusty Russell Signed-off-by: Avi Kivity commit c53b35b292e58cf234aa7ca08fc679e61d4b291b Author: Yang, Sheng Date: Tue Jul 31 14:23:01 2007 +0300 KVM: VMX: Add cpu consistency check All the physical CPUs on the board should support the same VMX feature set. Add check_processor_compatibility to kvm_arch_ops for the consistency check. Signed-off-by: Sheng Yang Signed-off-by: Avi Kivity commit 2e3bac2a9a2d52b6f349296812c5b752249e3e30 Author: Rusty Russell Date: Tue Jul 31 19:57:47 2007 +1000 KVM: kvm_vm_ioctl_get_dirty_log restore "nothing dirty" optimization kvm_vm_ioctl_get_dirty_log scans bitmap to see it it's all zero, but doesn't use that information. Avi says: Looks like it was used to guard kvm_mmu_slot_remove_write_access(); optimizing the case where the guest just leaves the screen alone (which it usually does, especially in benchmarks). I'd rather reinstate that optimization. See 90cb0529dd230548a7f0d6b315997be854caea1b where the damage was done. It's pretty simple: if the bitmap is all zero, we don't need to do anything to clean it. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 68fa04ca20fb8cf79e171c37bffd74466f12ad2b Author: Rusty Russell Date: Mon Jul 30 21:13:43 2007 +1000 KVM: Use alignment properties of vcpu to simplify FPU ops Now we use a kmem cache for allocating vcpus, we can get the 16-byte alignment required by fxsave & fxrstor instructions, and avoid manually aligning the buffer. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 0ce565a6fc253c87f26d51c506cd13554889a598 Author: Rusty Russell Date: Mon Jul 30 21:12:19 2007 +1000 KVM: Use kmem cache for allocating vcpus Avi wants the allocations of vcpus centralized again. The easiest way is to add a "size" arg to kvm_init_arch, and expose the thus-prepared cache to the modules. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 29b8a493b293639ae509c44386dc6a8ff79debd0 Author: Laurent Vivier Date: Mon Jul 30 13:41:19 2007 +0300 KVM: Remove kvm_{read,write}_guest() ... in favor of the more general emulator_{read,write}_*. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 6d2b86f131a3cbf370b4a65f6a6db63081cb6efb Author: Laurent Vivier Date: Mon Jul 30 13:35:24 2007 +0300 KVM: Change the emulator_{read,write,cmpxchg}_* functions to take a vcpu ... instead of a x86_emulate_ctxt, so that other callers can use it easily. Signed-off-by: Laurent Vivier Signed-off-by: Avi Kivity commit 80917728e43e248155c019f743655806b582b099 Author: Avi Kivity Date: Mon Jul 30 15:56:36 2007 +0300 KVM: x86 emulator: disable writeback for debug register instructions These are handled internally by the instruction. Signed-off-by: Avi Kivity commit 1c23728a5acd3a1fe5d628e23e3e4c27ee77118f Author: Rusty Russell Date: Mon Jul 30 20:08:05 2007 +1000 KVM: SVM: internal function name cleanup Changes some svm.c internal function names: 1) io_adress -> io_address (de-germanify the spelling) 2) kvm_reput_irq -> reput_irq (it's not a generic kvm function) 3) kvm_do_inject_irq -> (it's not a generic kvm function) Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 61736efb5398154eceafcce0337fe0621d7eeeb0 Author: Rusty Russell Date: Mon Jul 30 20:07:08 2007 +1000 KVM: SVM: de-containization container_of is wonderful, but not casting at all is better. This patch changes svm.c's internal functions to pass "struct vcpu_svm" instead of "struct kvm_vcpu" and using container_of. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit b15c5febefc05f04b5db04552bef18a6902e657c Author: Rusty Russell Date: Mon Jul 30 16:41:57 2007 +1000 KVM: Remove three magic numbers There are several places where hardcoded numbers are used in place of the easily-available constant, which is poor form. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit b21514dab8c88570bf2078249881a8210e50bafa Author: Rusty Russell Date: Mon Jul 30 16:31:43 2007 +1000 KVM: VMX: pass vcpu_vmx internally container_of is wonderful, but not casting at all is better. This patch changes vmx.c's internal functions to pass "struct vcpu_vmx" instead of "struct kvm_vcpu" and using container_of. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 7d3fd03221bb8352a263249e6adb1232064e4341 Author: Rusty Russell Date: Mon Jul 30 16:29:56 2007 +1000 KVM: fx_init() needs preemption disabled while it plays with the FPU state Now that kvm generally runs with preemption enabled, we need to protect the fpu intialization sequence. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 985bc8087daf3719d89e5ed28fe59eecd58fae71 Author: Shaohua Li Date: Mon Jul 23 14:51:37 2007 +0800 KVM: Convert vm lock to a mutex This allows the kvm mmu to perform sleepy operations, such as memory allocation. Signed-off-by: Shaohua Li Signed-off-by: Avi Kivity commit 8928fb48c7a7f9053a55f1d0023cbc533f2b3663 Author: Avi Kivity Date: Wed Jul 11 18:17:21 2007 +0300 KVM: Use the scheduler preemption notifiers to make kvm preemptible Current kvm disables preemption while the new virtualization registers are in use. This of course is not very good for latency sensitive workloads (one use of virtualization is to offload user interface and other latency insensitive stuff to a container, so that it is easier to analyze the remaining workload). This patch re-enables preemption for kvm; preemption is now only disabled when switching the registers in and out, and during the switch to guest mode and back. Contains fixes from Shaohua Li . Signed-off-by: Avi Kivity commit 510144c386fb650a5530311721ae9d90bf12eaee Author: Yang, Sheng Date: Sun Jul 29 11:07:42 2007 +0300 KVM: VMX: Improve the method of writing vmcs control Put cpu feature detecting part in hardware_setup, and stored the vmcs condition in global variable for further check. Signed-off-by: Sheng Yang Signed-off-by: Avi Kivity commit fbc4f2e23aa26a8537f8f147c75a632e498c39c7 Author: Rusty Russell Date: Fri Jul 27 17:16:56 2007 +1000 KVM: Dynamically allocate vcpus This patch converts the vcpus array in "struct kvm" to a pointer array, and changes the "vcpu_create" and "vcpu_setup" hooks into one "vcpu_create" call which does the allocation and initialization of the vcpu (calling back into the kvm_vcpu_init core helper). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 6532f26b4f39a409475918da47844eaff219f50b Author: Gregory Haskins Date: Fri Jul 27 08:13:10 2007 -0400 KVM: Remove arch specific components from the general code struct kvm_vcpu has vmx-specific members; remove them to a private structure. Signed-off-by: Gregory Haskins Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 35b8e2b29b372ab285819c3b84d6db1d0165998b Author: Rusty Russell Date: Wed Jul 25 13:29:51 2007 +1000 KVM: load_pdptrs() cleanups load_pdptrs can be handed an invalid cr3, and it should not oops. This can happen because we injected #gp in set_cr3() after we set vcpu->cr3 to the invalid value, or from kvm_vcpu_ioctl_set_sregs(), or memory configuration changes after the guest did set_cr3(). We should also copy the pdpte array once, before checking and assigning, otherwise an SMP guest can potentially alter the values between the check and the set. Finally one nitpick: ret = 1 should be done as late as possible: this allows GCC to check for unset "ret" should the function change in future. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 9cb698bd020974a7e950eca6285254b50b0b64d5 Author: Aurelien Jarno Date: Wed Jul 25 11:41:57 2007 +0200 KVM: Remove dead code in the cmpxchg instruction emulation The writeback fixes (02c03a326a5df825cc01de426f72e160db2b9538) let some dead code in the cmpxchg instruction emulation. Remove it. Signed-off-by: Aurelien Jarno Signed-off-by: Avi Kivity commit d9cbd1d77543d731f31e8ea5d1738d4aad81694a Author: Sheng Yang Date: Wed Jul 25 12:17:06 2007 +0300 KVM: VMX: Import some constants of vmcs from IA32 SDM This patch mainly imports some constants and rename two exist constants of vmcs according to IA32 SDM. It also adds two constants to indicate Lock bit and Enable bit in MSR_IA32_FEATURE_CONTROL, and replace the hardcode _5_ with these two bits. Signed-off-by: Sheng Yang Signed-off-by: Avi Kivity -- commit bfa6c62f98bd0602025d7b48e267d817082f5d07 Author: Aurelien Jarno Date: Wed Jul 25 10:19:54 2007 +0200 KVM: disable writeback for 0x0f 0x01 instructions. 0x0f 0x01 instructions (ie lgdt, lidt, smsw, lmsw and invlpg) does not use writeback. This patch set no_wb=1 when emulating those instructions. This fixes a regression booting the FreeBSD kernel on AMD. Signed-off-by: Aurelien Jarno Signed-off-by: Avi Kivity commit 24beb1e24843f05c3acfd20fc2fbcf4f5ab18ec7 Author: Shaohua Li Date: Mon Jul 23 14:51:39 2007 +0800 KVM: Move gfn_to_page out of kmap/unmap pairs gfn_to_page might sleep with swap support. Move it out of the kmap calls. Signed-off-by: Shaohua Li Signed-off-by: Avi Kivity commit 33c5dfed96a8cb19ccc2e08073ef97e5c731dae3 Author: Avi Kivity Date: Wed Jul 25 09:22:12 2007 +0300 KVM: Fix removal of nx capability from guest cpuid Testing the wrong bit caused kvm not to disable nx on the guest when it is disabled on the host (an mmu optimization relies on the nx bits being the same in the guest and host). This allows Windows to boot when nx is disabled on te host (e.g. when host pae is disabled). Signed-off-by: Avi Kivity commit 8d4faaba7b1ac40b96709dc244e7d81058918a08 Author: Shaohua Li Date: Mon Jul 23 14:51:32 2007 +0800 KVM: Hoist kvm_mmu_reload() out of the critical section vmx_cpu_run doesn't handle error correctly and kvm_mmu_reload might sleep with mutex changes, so I move it above. Signed-off-by: Shaohua Li Signed-off-by: Avi Kivity commit b41e5014dd8712e8de2b656617f7a7a158cd992a Author: Avi Kivity Date: Mon Jul 23 18:33:14 2007 +0300 Revert "KVM: Avoid useless memory write when possible" This reverts commit 8a1449563b3e5ede56b28cc977c8da22a17cdf51. While it does save useless updates, it (probably) defeats the fork detector, causing a massive performance loss. Signed-off-by: Avi Kivity commit 4d69bc0c78587849583d63ada004c82dc6277829 Author: Rusty Russell Date: Mon Jul 23 17:11:02 2007 +1000 KVM: Return if the pdptrs are invalid when the guest turns on PAE. Don't fall through and turn on PAE in this case. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit e8c2eb98b58dd135b14d87e6dd1d621bc630d919 Author: Rusty Russell Date: Mon Jul 23 17:08:21 2007 +1000 KVM: Fix unlikely kvm_create vs decache_vcpus_on_cpu race We add the kvm to the vm_list before initializing the vcpu mutexes, which can be mutex_trylock()'ed by decache_vcpus_on_cpu(). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit aae0954ed6ac2a00ee76fd209aa2a39bb2f43a0c Author: Avi Kivity Date: Sun Jul 22 18:48:54 2007 +0300 KVM: Correctly handle writes crossing a page boundary Writes that are contiguous in virtual memory may not be contiguous in physical memory; so split writes that straddle a page boundary. Thanks to Aurelien for reporting the bug, patient testing, and a fix to this very patch. Signed-off-by: Aurelien Jarno Signed-off-by: Avi Kivity commit 76f0301b5e4d2603d8e1ee5295db29faea660b49 Author: Avi Kivity Date: Sun Jul 22 15:51:58 2007 +0300 KVM: x86 emulator: fix faulty check for two-byte opcode Right now, the bug is harmless as we never emulate one-byte 0xb6 or 0xb7. But things may change. Noted by the mysterious Gabriel C. Signed-off-by: Avi Kivity commit 86ba3093d785da1d2d1c5ecbf060d91edd7a5092 Author: Avi Kivity Date: Sun Jul 22 12:32:57 2007 +0300 KVM: Require CONFIG_ANON_INODES Found by Sebastian Siewior and randconfig. Signed-off-by: Avi Kivity commit 6da018860ce19321e25b685b72f3836d243c2137 Author: Avi Kivity Date: Sat Jul 21 09:00:21 2007 +0300 KVM: MMU: Fix cleaning up the shadow page allocation cache __free_page() wants a struct page, not a virtual address. Signed-off-by: Avi Kivity commit 29530eb22ba3b0baf260e2767cb125b61151ed25 Author: Avi Kivity Date: Fri Jul 20 12:30:58 2007 +0300 KVM: x86 emulator: fix cmov for writeback changes The writeback fixes (02c03a326a5df825cc01de426f72e160db2b9538) broke cmov emulation. Fix. Signed-off-by: Avi Kivity commit 92bd26eb2a199716ceeb5604b8f9f5ed7e69ac3d Author: Avi Kivity Date: Fri Jul 20 08:18:27 2007 +0300 KVM: MMU: Fix oopses with SLUB The kvm mmu uses page->private on shadow page tables; so does slub, and an oops result. Fix by allocating regular pages for shadows instead of using slub. Signed-off-by: Avi Kivity commit 860852357a6590299a273f1141dbf1871df0b491 Author: Rusty Russell Date: Tue Jul 17 23:37:17 2007 +1000 KVM: Use standard CR8 flags, and fix TPR definition Intel manual (and KVM definition) say the TPR is 4 bits wide. Also fix CR8_RESEVED_BITS typo. Signed-off-by: Rusty Russell Acked-by: H. Peter Anvin Signed-off-by: Avi Kivity commit 56282e5368afbc8ec6eebb6413bbb2ec0733d0ed Author: Jeff Dike Date: Tue Jul 17 12:26:59 2007 -0400 KVM: Set exit_reason to KVM_EXIT_MMIO where run->mmio is initialized. Signed-off-by: Jeff Dike Signed-off-by: Avi Kivity commit 7e5437f39897a09e79e69bd0c8d4641f13715cc4 Author: Rusty Russell Date: Wed Jul 18 13:05:58 2007 +1000 KVM: Trivial: Use standard BITMAP macros, open-code userspace-exposed header Creating one's own BITMAP macro seems suboptimal: if we use manual arithmetic in the one place exposed to userspace, we can use standard macros elsewhere. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 0dfb860def58bfb2daa000af490ed1986373fea5 Author: Rusty Russell Date: Tue Jul 17 23:34:16 2007 +1000 Use standard CR4 flags, tighten checking On this machine (Intel), writing to the CR4 bits 0x00000800 and 0x00001000 cause a GPF. The Intel manual is a little unclear, but AFIACT they're reserved, too. Also fix spelling of CR4_RESEVED_BITS. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 2aee2b5274884f40475fe9ad6a7f7a3d608e0ea4 Author: Rusty Russell Date: Tue Jul 17 23:32:55 2007 +1000 Use standard CR3 flags, tighten checking The kernel now has asm/cpu-features.h: use those macros instead of inventing our own. Also spell out definition of CR3_RESEVED_BITS, fix spelling and tighten it for the non-PAE case. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 688e14654b3ffb0292a209c052e7579948b17f27 Author: Rusty Russell Date: Tue Jul 17 23:19:08 2007 +1000 KVM: Trivial: Use standard CR0 flags macros from asm/cpu-features.h The kernel now has asm/cpu-features.h: use those macros instead of inventing our own. Also spell out definition of CR0_RESEVED_BITS (no code change) and fix typo. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 0da5e37f4dc3df7a941ddba8863b289863e8dd40 Author: Rusty Russell Date: Tue Jul 17 23:17:55 2007 +1000 KVM: Trivial: Avoid hardware_disable predeclaration Don't pre-declare hardware_disable: shuffle the reboot hook down. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 24356bfad9c4b8ba70920153aec00e78698ccb9a Author: Rusty Russell Date: Tue Jul 17 23:16:56 2007 +1000 KVM: Trivial: Comment spelling may escape grep Speling error in comment. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 793551cce1b90fac232e0a38269247815fb0d02a Author: Rusty Russell Date: Tue Jul 17 23:16:11 2007 +1000 KVM: Trivial: Make decode_register() static I have shied away from touching x86_emulate.c (it could definitely use some love, but it is forked from the Xen code, and it would be more productive to cross-merge fixes). Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 53df15a3cae92d4528dc8de21132bed3aa929ca1 Author: Rusty Russell Date: Tue Jul 17 23:15:29 2007 +1000 KVM: Trivial: Remove unused struct cpu_user_regs declaration Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit a9531af471c86779d28ba973cf5f54f82cfbdb8d Author: Rusty Russell Date: Tue Jul 17 23:12:26 2007 +1000 KVM: Trivial: /dev/kvm interface is no longer experimental. KVM interface is no longer experimental. Signed-off-by: Rusty Russell Signed-off-by: Avi Kivity commit 817d90b391f6c51d07bf9d6a94778a5957d46f65 Author: Avi Kivity Date: Tue Jul 17 14:20:30 2007 +0300 KVM: x86 emulator: implement rdmsr and wrmsr Allow real-mode emulation of rdmsr and wrmsr. This allows smp Windows to boot, presumably for its sipi trampoline. Signed-off-by: Avi Kivity commit 66d8a4e4d4bd470216028daabb9d887b73259c96 Author: Avi Kivity Date: Tue Jul 17 13:04:56 2007 +0300 KVM: Fix memory slot management functions for guest smp The memory slot management functions were oriented against vcpu 0, where they should be kvm-wide. This causes hangs starting X on guest smp. Fix by making the functions (and resultant tail in the mmu) non-vcpu-specific. Unfortunately this reduces the efficiency of the mmu object cache a bit. We may have to revisit this later. Signed-off-by: Avi Kivity commit 4dd0d9a876db49da29185c868cbea6c77c09c600 Author: Eddie Dong Date: Tue Jul 17 11:52:33 2007 +0300 KVM: In-kernel string pio write support Add string pio write support to support some version of Windows. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 7bb566d5c8661a179106579978c0c606e7fa8a93 Author: Avi Kivity Date: Tue Jul 17 11:45:55 2007 +0300 KVM:: Future-proof the exit information union ABI Note that as the size of struct kvm_run is not part of the ABI, we can add things at the end. Signed-off-by: Avi Kivity commit f2973ff11f9f8ef4b90413cea9cedd7f20639e3e Author: Jeff Dike Date: Mon Jul 16 15:24:47 2007 -0400 KVM - add hypercall nr to kvm_run Add the hypercall number to kvm_run and initialize it. This changes the ABI, but as this particular ABI was unusable before this no users are affected. Signed-off-by: Jeff Dike Signed-off-by: Avi Kivity commit 973ae594c1a65936fc09acab412be51d97b703b9 Author: Qing He Date: Thu Jul 12 12:33:56 2007 +0300 KVM: SMP: Add vcpu_id field in struct vcpu This patch adds a `vcpu_id' field in `struct vcpu', so we can differentiate BSP and APs without pointer comparison or arithmetic. Signed-off-by: Qing He Signed-off-by: Avi Kivity commit 9f5aa99d6256aa14b64683283ba1c4be910bc67e Author: Nguyen Anh Quynh Date: Wed Jul 11 14:30:54 2007 +0300 KVM: Fix *nopage() in kvm_main.c *nopage() in kvm_main.c should only store the type of mmap() fault if the pointers are not NULL. This patch fixes the problem. Signed-off-by: Nguyen Anh Quynh Signed-off-by: Avi Kivity commit 6287464e41b2b520d78d417f3d1b37aca9202a04 Author: Avi Kivity Date: Tue Jul 10 17:50:55 2007 +0300 KVM: MMU: Store nx bit for large page shadows We need to distinguish between large page shadows which have the nx bit set and those which don't. The problem shows up when booting a newer smp Linux kernel, where the trampoline page (which is in real mode, which uses the same shadow pages as large pages) is using the same mapping as a kernel data page, which is mapped using nx, causing kvm to spin on that page. Signed-off-by: Avi Kivity commit a737ba627a98f2ae66c308148c9c967c73f13f5d Author: Avi Kivity Date: Thu May 24 13:11:41 2007 +0300 KVM: Use CPU_DYING for disabling virtualization Only at the CPU_DYING stage can we be sure that no user process will be scheduled onto the cpu and oops when trying to use virtualization extensions. Signed-off-by: Avi Kivity commit 4fba051d7ec9ec1961f477d9a20311d8432738b7 Author: Avi Kivity Date: Thu May 24 13:09:41 2007 +0300 KVM: Tune hotplug/suspend IPIs The hotplug IPIs can be called from the cpu on which we are currently running on, so use on_cpu(). Similarly, drop on_each_cpu() for the suspend/resume callbacks, as we're in atomic context here and only one cpu is up anyway. Signed-off-by: Avi Kivity commit 63e8e638342401a5fd04ec310c5d0695c645e444 Author: Avi Kivity Date: Thu May 24 13:03:52 2007 +0300 KVM: Keep track of which cpus have virtualization enabled By keeping track of which cpus have virtualization enabled, we prevent double-enable or double-disable during hotplug, which is a very fatal oops. Signed-off-by: Avi Kivity commit 9b6f4dedfeb83190b6196fe201e2f33c97de1c73 Author: Avi Kivity Date: Thu May 24 12:42:10 2007 +0300 SMP: Implement on_cpu() This defines on_cpu() which is similar to smp_call_function_single() except that it works if cpu happens to be the current cpu. Can also be seen as a complement to on_each_cpu() (which also doesn't treat the current cpu specially). Signed-off-by: Avi Kivity commit 55971a0f3faab6ecdce1e17dafc6d968f3236ade Author: Avi Kivity Date: Thu May 24 12:37:34 2007 +0300 HOTPLUG: Adapt thermal throttle to CPU_DYING CPU_DYING is notified in atomic context, so no taking mutexes here. Signed-off-by: Avi Kivity commit 529bd39d193eeae66a7c0fc3b12169ea566dc0e5 Author: Avi Kivity Date: Thu May 24 12:33:15 2007 +0300 HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING CPU_DYING is called in atomic context, so don't try to take any locks. Signed-off-by: Avi Kivity commit 33e6f5c2bd102cb43a1e9ae5fe210b0d5f9ac69f Author: Avi Kivity Date: Thu May 24 12:23:10 2007 +0300 HOTPLUG: Add CPU_DYING notifier KVM wants a notification when a cpu is about to die, so it can disable hardware extensions, but at a time when user processes cannot be scheduled on the cpu, so it doesn't try to use virtualization extensions after they have been disabled. This adds a CPU_DYING notification. The notification is called in atomic context on the doomed cpu. Signed-off-by: Avi Kivity commit 0d9c57e0a7ee426096af3d79114d23e50ed6d42b Author: Avi Kivity Date: Sun Jul 8 11:15:32 2007 +0300 KVM: Fix svm availability check miscompile on i386 Signed-off-by: Avi Kivity commit 222a35d12ad9ef4f4a97da496f0e038e94681d3b Author: Avi Kivity Date: Thu Jun 28 14:15:57 2007 -0400 KVM: Clean up #includes Remove unnecessary ones, and rearange the remaining in the standard order. Signed-off-by: Avi Kivity commit 41ac4b23696b12fec15191969bc18da42359861d Author: Avi Kivity Date: Thu Jun 28 08:38:16 2007 -0400 KVM: Remove kvmfs in favor of the anonymous inodes source kvm uses a pseudo filesystem, kvmfs, to generate inodes, a job that the new anonymous inodes source does much better. Cc: Davide Libenzi Signed-off-by: Avi Kivity commit cfc329b216bc3e54fe1107e8f714c7b3bc133224 Author: Joerg Roedel Date: Fri Jun 22 12:29:50 2007 +0300 KVM: SVM: Reliably detect if SVM was disabled by BIOS This patch adds an implementation to the svm is_disabled function to detect reliably if the BIOS disabled the SVM feature in the CPU. This fixes the issues with kernel panics when loading the kvm-amd module on machines where SVM is available but disabled. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit a2a8a256f8d4ff1595900b810fea90e5e5911b6d Author: Avi Kivity Date: Thu Jun 21 11:54:45 2007 +0300 KVM: VMX: Remove unnecessary code in vmx_tlb_flush() A vmexit implicitly flushes the tlb; the code is bogus. Noted by Shaohua Li. Signed-off-by: Avi Kivity commit 37ebbf17fbf71ec261c57c1404ac7c50ade97c13 Author: Shaohua Li Date: Wed Jun 20 17:13:26 2007 +0800 KVM: MMU: Fix Wrong tlb flush order Need to flush the tlb after updating a pte, not before. Signed-off-by: Shaohua Li Signed-off-by: Avi Kivity commit 030421334ae91b7f6302a1cfe9c971a8991b4870 Author: Avi Kivity Date: Wed Jun 20 11:20:04 2007 +0300 KVM: VMX: Reinitialize the real-mode tss when entering real mode Protected mode code may have corrupted the real-mode tss, so re-initialize it when switching to real mode. Signed-off-by: Avi Kivity commit 8a1449563b3e5ede56b28cc977c8da22a17cdf51 Author: Luca Tettamanti Date: Tue Jun 19 22:41:38 2007 +0200 KVM: Avoid useless memory write when possible When writing to normal memory and the memory area is unchanged the write can be safely skipped, avoiding the costly kvm_mmu_pte_write. Signed-Off-By: Luca Tettamanti Signed-off-by: Avi Kivity commit ba9c20c048726037664d303362b688759fdf6e9d Author: Luca Tettamanti Date: Tue Jun 19 22:41:20 2007 +0200 KVM: Fix x86 emulator writeback When the old value and new one are the same the emulator skips the write; this is undesirable when the destination is a MMIO area and the write shall be performed regardless of the previous value. This optimization breaks e.g. a Linux guest APIC compiled without X86_GOOD_APIC. Remove the check and perform the writeback stage in the emulation unless it's explicitly disabled (currently push and some 2 bytes instructions may disable the writeback). Signed-Off-By: Luca Tettamanti Signed-off-by: Avi Kivity commit 8e770bbe8651e8d13e1d09d426657fbed0fe052a Author: Eddie Dong Date: Tue Jun 19 18:05:03 2007 +0300 KVM: Add support for in-kernel pio handlers Useful for the PIC and PIT. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit ecd01fac443e69a574cb064d44e78ff783a1e1a4 Author: Gregory Haskins Date: Thu May 31 14:08:58 2007 -0400 KVM: VMX: Fix interrupt checking on lightweight exit With kernel-injected interrupts, we need to check for interrupts on lightweight exits too. Signed-off-by: Gregory Haskins Signed-off-by: Avi Kivity commit af93971fab7729229a45ecd64c72f56421bbcd0f Author: Gregory Haskins Date: Thu May 31 14:08:53 2007 -0400 KVM: Adds support for in-kernel mmio handlers Signed-off-by: Gregory Haskins Signed-off-by: Avi Kivity commit e0d1fb847d117124da53145b2d9b7f4d3da8e82c Author: Nitin A Kamble Date: Tue Jun 19 11:21:15 2007 +0300 KVM: Implement emulation of instruction "ret" (opcode 0xc3) Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit 246e9cd14121973b3c653b990d80bcd1c2163dd5 Author: Nitin A Kamble Date: Tue Jun 19 11:16:04 2007 +0300 KVM: Implement emulation of "pop reg" instruction (opcode 0x58-0x5f) For use in real mode. Signed-off-by: Nitin A Kamble Signed-off-by: Avi Kivity commit b0c4137315fc6f711fd3a0fc82aedb61a2536ac9 Author: Avi Kivity Date: Sun Jun 17 12:24:23 2007 +0300 KVM: Bring local tree in line with origin Signed-off-by: Avi Kivity commit 6685637b211ad67bdce21bfd9f91bc888b3acb4f Author: Avi Kivity Date: Wed Jun 13 19:55:28 2007 +0300 KVM: VMX: Ensure vcpu time stamp counter is monotonous If the time stamp counter goes backwards, a guest delay loop can become infinite. This can happen if a vcpu is migrated to another cpu, where the counter has a lower value than the first cpu. Since we're doing an IPI to the first cpu anyway, we can use that to pick up the old tsc, and use that to calculate the adjustment we need to make to the tsc offset. Signed-off-by: Avi Kivity commit 8aefa5d7ac55d487af62755545ecc02bc53678af Author: Avi Kivity Date: Wed Jun 13 19:43:19 2007 +0300 KVM: Initialize the BSP bit in the APIC_BASE msr correctly Needs to be set on vcpu 0 only. Signed-off-by: Avi Kivity commit 218179e7978af0308bcbd08f6c43bd5b3607a909 Author: Avi Kivity Date: Tue Jun 12 08:58:13 2007 +0300 KVM: Require a cpu which can set 64-bit values atomically set_64bit() is not available on 80386 and i486. Noticed by Adrian Bunk. Signed-off-by: Avi Kivity commit 74a54c5cfe3a1ea3777964a9e8e7bef119ca549b Author: Shani Moideen Date: Mon Jun 11 09:31:33 2007 +0530 KVM: VMX: Replace memset(, 0, PAGESIZE) with clear_page() Signed-off-by: Shani Moideen Signed-off-by: Avi Kivity commit ff4d2f93a9459aa820b56a59e9dbd3967aa407ce Author: Shani Moideen Date: Mon Jun 11 09:28:26 2007 +0530 KVM: SVM: Replace memset(, 0, PAGESIZE) with clear_page() Signed-off-by: Shani Moideen Signed-off-by: Avi Kivity commit 3105c9a9a2d5f64c9e67745120b8ee5c205847a3 Author: Avi Kivity Date: Thu Jun 7 19:18:30 2007 +0300 KVM: Flush remote tlbs when reducing shadow pte permissions When a vcpu causes a shadow tlb entry to have reduced permissions, it must also clear the tlb on remote vcpus. We do that by: - setting a bit on the vcpu that requests a tlb flush before the next entry - if the vcpu is currently executing, we send an ipi to make sure it exits before we continue Signed-off-by: Avi Kivity commit 2c3ac418d752e7f73ca0d9081a4377278432d565 Author: Avi Kivity Date: Thu Jun 7 19:11:53 2007 +0300 KVM: Keep an upper bound of initialized vcpus That way, we don't need to loop for KVM_MAX_VCPUS for a single vcpu vm. Signed-off-by: Avi Kivity commit 7ca30c3f2efbf9ab5ab595d9bc3e0bd3b705aba1 Author: Avi Kivity Date: Tue Jun 5 16:15:51 2007 +0300 KVM: Emulate hlt on real mode for Intel This has two use cases: the bios can't boot from disk, and guest smp bootstrap. Signed-off-by: Avi Kivity commit e7ebb74dbacc100cfd621157ac63b95e63e3292d Author: Avi Kivity Date: Tue Jun 5 15:53:05 2007 +0300 KVM: Move duplicate halt handling code into kvm_main.c Will soon have a thid user. Signed-off-by: Avi Kivity commit a80408da7a05e0be2ae99ad47dafd4bb4bc847cd Author: Avi Kivity Date: Tue Jun 5 14:37:09 2007 +0300 KVM: Enable guest smp As we don't support guest tlb shootdown yet, this is only reliable for real-mode guests. Signed-off-by: Avi Kivity commit 80b70c068ce4333e5e1242f32f538835a4e5d896 Author: Avi Kivity Date: Tue Jun 5 14:36:10 2007 +0300 KVM: Fix adding an smp virtual machine to the vm list If we add the vm once per vcpu, we corrupt the list if the guest has multiple vcpus. Signed-off-by: Avi Kivity commit 16fb83998b62717831dca3d913455091c855b3cd Author: Avi Kivity Date: Tue Jun 5 12:17:03 2007 +0300 KVM: Fix vcpu freeing for guest smp A vcpu can pin up to four mmu shadow pages, which means the freeing loop will never terminate. Fix by first unpinning shadow pages on all vcpus, then freeing shadow pages. Signed-off-by: Avi Kivity commit 55ae364d6a882c94511db17e8023c8976d44cd2d Author: Nguyen Anh Quynh Date: Tue Jun 5 10:35:19 2007 +0300 KVM: Remove unnecessary initialization and checks in mark_page_dirty() Signed-off-by: Avi Kivity commit 0ae1aebcc9825fba4d115c197e9c099fd9644caf Author: Robert P. J. Day Date: Sun Jun 3 13:35:29 2007 -0400 KVM: Replace C code with call to ARRAY_SIZE() macro. Signed-off-by: Robert P. J. Day Signed-off-by: Avi Kivity commit 4b82b37a35a085a07d9ed84efee06c69655fd3d1 Author: Avi Kivity Date: Mon Jun 4 15:58:30 2007 +0300 KVM: Lazy guest cr3 switching Switch guest paging context may require us to allocate memory, which might fail. Instead of wiring up error paths everywhere, make context switching lazy and actually do the switch before the next guest entry, where we can return an error if allocation fails. Signed-off-by: Avi Kivity commit fa8cfb020b0ef0acef94ddc9035b932308840314 Author: Avi Kivity Date: Mon Jun 4 11:11:23 2007 +0300 KVM: VMX: Fix asm constraint "g" can select a memory location, in which case size information is lost and gas needs an instruction suffix. Since the suffix is different for i386 and x86_64, we simply change the constraint to "r". Signed-off-by: Avi Kivity commit 63275ba244275719d6fd4d77c10d6b15586aa727 Author: Avi Kivity Date: Thu May 31 18:28:51 2007 +0300 KVM: MMU: Remove unused large page marker This has not been used for some time, as the same information is available in the page header. Signed-off-by: Avi Kivity commit 21e3670e57c34809d4c141ce1dde4fd8b23a4d60 Author: Avi Kivity Date: Thu May 31 18:24:09 2007 +0300 KVM: MMU: Don't cache guest access bits in the shadow page table This was once used to avoid accessing the guest pte when upgrading the shadow pte from read-only to read-write. But usually we need to set the guest pte dirty or accessed bits anyway, so this wasn't really exploited. Signed-off-by: Avi Kivity commit 319d035ef290b510edb7f848d41098c31ceaace0 Author: Avi Kivity Date: Thu May 31 18:20:14 2007 +0300 KVM: MMU: Simpify accessed/dirty/present/nx bit handling Always set the accessed and dirty bit (since having them cleared causes a read-modify-write cycle), always set the present bit, and copy the nx bit from the guest. Signed-off-by: Avi Kivity commit 080e7fd753ec60140ea89ebb0ea94625ae541534 Author: Avi Kivity Date: Thu May 31 17:17:06 2007 +0300 KVM: MMU: Remove cr0.wp tricks No longer needed as we do everything in one place. Signed-off-by: Avi Kivity commit cc9d465c7a9ef3a109814fa866676f876ff42133 Author: Avi Kivity Date: Thu May 31 15:46:04 2007 +0300 KVM: MMU: Make setting shadow ptes atomic on i386 Signed-off-by: Avi Kivity commit 823c30e8740ad71bd9556f3cd235231ad00bfa55 Author: Avi Kivity Date: Thu May 31 15:23:35 2007 +0300 KVM: Make shadow pte updates atomic With guest smp, a second vcpu might see partial updates when the first vcpu services a page fault. So delay all updates until we have figured out what the pte should look like. Note that on i386, this is still not completely atomic as a 64-bit write will be split into two on a 32-bit machine. Signed-off-by: Avi Kivity commit b7bd6888968e797f2deaa4aa9f98466a2371392b Author: Avi Kivity Date: Thu May 31 15:14:09 2007 +0300 KVM: Move shadow pte modifications from set_pte/set_pde to set_pde_common() We want all shadow pte modifications in one place. Signed-off-by: Avi Kivity commit b70ccb0b3fd4ac02c0f6cf5153008c736fa27710 Author: Avi Kivity Date: Thu May 31 15:08:29 2007 +0300 KVM: MMU: Fold fix_write_pf() into set_pte_common() This prevents some work from being performed twice, and, more importantly, reduces the number of places where we modify shadow ptes. Signed-off-by: Avi Kivity commit ad5555224aa01b2ddcc45ab9f0172b5497a7cd5d Author: Avi Kivity Date: Thu May 31 11:56:54 2007 +0300 KVM: MMU: Fold fix_read_pf() into set_pte_common() Signed-off-by: Avi Kivity commit 3f1380d422cbd5b9231c3e997e4cbec000e3a08f Author: Avi Kivity Date: Thu May 31 11:45:18 2007 +0300 KVM: MMU: Pass the guest pde to set_pte_common We will need the accessed bit (in addition to the dirty bit) and also write access (for setting the dirty bit) in a future patch. Signed-off-by: Avi Kivity commit 5fe13ee0e2b404dd34dea17ec0849b4a940a5755 Author: Avi Kivity Date: Wed May 30 19:31:17 2007 +0300 KVM: MMU: Move set_pte_common() to pte width dependent code In preparation of some modifications. Signed-off-by: Avi Kivity commit 5ada0f87635fa10a40a22b8b249c3d1fedb79840 Author: Avi Kivity Date: Wed May 30 14:21:51 2007 +0300 KVM: MMU: Simplify fetch() a little bit Signed-off-by: Avi Kivity commit 67310badceaed0519cb8efbe6054d790563ea136 Author: Avi Kivity Date: Wed May 30 12:34:53 2007 +0300 KVM: MMU: Use slab caches for shadow pages and their headers Use slab caches instead of a simple custom list. Signed-off-by: Avi Kivity commit 6d9d80f421f77da043b8b6898e01327763adecd2 Author: Eddie Dong Date: Tue May 29 15:07:21 2007 +0300 KVM: Use symbolic constants instead of magic numbers Signed-off-by: Avi Kivity commit 4eaa906699812e2e28c3237cfedd8c21cbd17c4b Author: Markus Rechberger Date: Sun May 27 10:46:52 2007 +0300 KVM: Fix includes KVM compilation fails for some .configs. This fixes it. Signed-off-by: Markus Rechberger Signed-off-by: Avi Kivity commit d67c455e06a1eaf8ab20b5c4e51f4ae8271b2637 Author: Avi Kivity Date: Thu May 24 11:17:33 2007 +0300 KVM: x86 emulator: implement wbinvd Vista seems to trigger it. Signed-off-by: Avi Kivity commit fc1193d546ec21c279a8e4e3e9eaf999275b2223 Author: Jan Engelhardt Date: Wed May 23 14:22:11 2007 -0700 Use menuconfig objects II - KVM/Virt Make a "menuconfig" out of the Kconfig objects "menu, ..., endmenu", so that the user can disable all the options in that menu at once instead of having to disable each option separately. Signed-off-by: Jan Engelhardt Signed-off-by: Andrew Morton Signed-off-by: Avi Kivity commit a6935dbdaa7278d5e4a4d7478f29462f2a5db7fe Author: Avi Kivity Date: Mon May 21 09:15:47 2007 +0300 KVM: VMX: Remove warnings on i386 Signed-off-by: Avi Kivity commit 1ab29f3fb765b08e65de563d9053d4d05cc95f52 Author: Eddie Dong Date: Mon May 21 07:28:09 2007 +0300 KVM: VMX: Avoid saving and restoring msr_efer on lightweight vmexit MSR_EFER.LME/LMA bits are automatically save/restored by VMX hardware, KVM only needs to save NX/SCE bits at time of heavy weight VM Exit. But clearing NX bits in host envirnment may cause system hang if the host page table is using EXB bits, thus we leave NX bits as it is. If Host NX=1 and guest NX=0, we can do guest page table EXB bits check before inserting a shadow pte (though no guest is expecting to see this kind of gp fault). If host NX=0, we present guest no Execute-Disable feature to guest, thus no host NX=0, guest NX=1 combination. This patch reduces raw vmexit time by ~27%. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 64ce9a0cf0960f9a029e54d1bffc06123d3b5893 Author: Eddie Dong Date: Sun May 20 16:28:59 2007 +0300 KVM: VMX: Fix a typo which mixes X86_64 and CONFIG_X86_64 This prevents compilation on 64-bits. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit cc1d717e078464a049cf8364417ec44267cd6143 Author: Eddie Dong Date: Sun May 20 10:50:08 2007 +0300 KVM: VMX: Cleanup redundant code in MSR set Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 8bf50c5c6b2af81355412ec1696a7e2c8ad940f2 Author: Daniel Hecken Date: Sun May 20 10:32:14 2007 +0300 KVM: VMX: Compile-fix for 32-bit hosts Signed-off-by: Avi Kivity commit f552bf62c86b383dd74030c5830c8043bf41e0bd Author: Eddie Dong Date: Thu May 17 18:55:15 2007 +0300 KVM: VMX: Avoid saving and restoring msrs on lightweight vmexit In a lightweight exit (where we exit and reenter the guest without scheduling or exiting to userspace in between), we don't need various msrs on the host, and avoiding shuffling them around reduces raw exit time by 8%. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 8edb11391b763357734cc5fd293d788d8591e6d7 Author: Nitin A Kamble Date: Thu May 17 15:50:34 2007 +0300 KVM: VMX: Handle #SS faults from real mode Instructions with address size override prefix opcode 0x67 Cause the #SS fault with 0 error code in VM86 mode. Forward them to the emulator. Signed-Off-By: Nitin A Kamble Signed-off-by: Avi Kivity commit bdf3f418471ba3c65aa78a1943da179d8320fdf8 Author: Avi Kivity Date: Mon May 14 20:41:13 2007 +0300 KVM: VMX: Use local labels in inline assembly This makes oprofile dumps and disassebly easier to read. Signed-off-by: Avi Kivity commit ca76d209b88c344fc6a8eac17057c0088a3d6940 Author: Avi Kivity Date: Sun May 13 20:18:14 2007 +0300 KVM: Remove merge artifact Signed-off-by: Avi Kivity commit 52916bb7c142b5cf8a81da225bf51c2ea60c5b49 Author: Avi Kivity Date: Tue May 8 11:34:07 2007 +0300 KVM: Fix vmx I/O bitmap initialization on highmem systems kunmap() expects a struct page, not a virtual address. Fixes an oops loading kvm-intel.ko on i386 with CONFIG_HIGHMEM. Thanks to Michael Ivanov for reporting. Signed-off-by: Avi Kivity commit facc2faaf471ca539ddd96fdbdf2e147421468a6 Author: Avi Kivity Date: Mon May 7 10:55:37 2007 +0300 KVM: Avoid corrupting tr in real mode The real mode tr needs to be set to a specific tss so that I/O instructions can function. Divert the new tr values to the real mode save area from where they will be restored on transition to protected mode. This fixes some crashes on reboot when the bios accesses an I/O instruction. Signed-off-by: Avi Kivity commit 05eb943c9b547ecc4de850f04ed4c09356440528 Author: Avi Kivity Date: Sun May 6 16:10:01 2007 +0300 KVM: VMX: Only reload guest msrs if they are already loaded If we set an msr via an ioctl() instead of by handling a guest exit, we have the host state loaded, so reloading the msrs would clobber host state instead of guest state. This fixes a host oops (and loss of a cpu) on a guest reboot. Signed-off-by: Avi Kivity commit 242b0f9ae76651226fb42d9ec3ecb1a9d8d7b263 Author: Avi Kivity Date: Sun May 6 15:50:58 2007 +0300 KVM: MMU: Store shadow page tables as kernel virtual addresses, not physical Simpifies things a bit. Signed-off-by: Avi Kivity commit 03aeb06a4440265777ae4ed62e8431955cbea865 Author: Avi Kivity Date: Sun May 6 15:36:30 2007 +0300 KVM: MMU: Simplify kvm_mmu_free_page() a tiny bit Signed-off-by: Avi Kivity commit f66b4a983d460d68ef5cc392285190065b0617e5 Author: Matthew Gregan Date: Sun May 6 10:59:46 2007 +0300 KVM: Implement IA32_EBL_CR_POWERON msr Attempting to boot the default 'bsd' kernel of OpenBSD 4.1 i386 in a guest fails early in the kernel init inside p3_get_bus_clock while trying to read the IA32_EBL_CR_POWERON MSR. KVM logs an 'unhandled MSR' message and the guest kernel faults. This patch is sufficient to allow OpenBSD to boot, after which it seems to run fine. I'm not sure if this is the correct solution for dealing with this particular MSR, but it works for me. Signed-off-by: Matthew Gregan Signed-off-by: Avi Kivity commit 7a57011a5e7c4082fdfd204115a8212298ef723f Author: Avi Kivity Date: Wed May 2 23:06:22 2007 +0300 KVM: Set cr0.mp for guests This allows fwait instructions to be trapped when the guest fpu is not loaded. Signed-off-by: Avi Kivity commit 90fb720a59dafb11d591a8e53a4a65bfa6fcfea9 Author: Avi Kivity Date: Wed May 2 22:57:13 2007 +0300 KVM: Ensure host cr0.ts is saved Otherwise, host fpu state may be corrupted after an exit. Signed-off-by: Avi Kivity commit 7616f59b208b088afd85d40aa06ca6d4d4a6ca1a Author: Avi Kivity Date: Wed May 2 20:40:00 2007 +0300 KVM: Consolidate guest fpu activation and deactivation Easier to keep track of where the fpu is this way. Signed-off-by: Avi Kivity commit 7ca14868fd7f3c0dc21450e61cca5b77a47daf0d Author: Avi Kivity Date: Wed May 2 17:57:40 2007 +0300 KVM: Rationalize exception bitmap usage Everyone owns a piece of the exception bitmap, but they happily write to the entire thing like there's no tomorrow. Centralize handling in update_exception_bitmap() and have everyone call that. Signed-off-by: Avi Kivity commit de32f820227fbe3e159ec42ce8fd55057155edca Author: Avi Kivity Date: Wed May 2 17:33:43 2007 +0300 KVM: Move some more msr mangling into vmx_save_host_state() Signed-off-by: Avi Kivity commit fa580ecc53536620546659740ae2dfcea763d17c Author: Avi Kivity Date: Wed May 2 17:30:48 2007 +0300 KVM: Prevent guest fpu state from leaking into the host The lazy fpu changes did not take into account that some vmexit handlers can sleep. Move loading the guest state into the inner loop so that it can be reloaded if necessary, and move loading the host state into vmx_vcpu_put() so it can be performed whenever we relinquish the vcpu. Signed-off-by: Avi Kivity commit bc8dcc2107de0ba8f25fc910c4559ebe3df33045 Author: Avi Kivity Date: Wed May 2 16:54:03 2007 +0300 KVM: Fix potential guest state leak into host The lightweight vmexit path avoids saving and reloading certain host state. However in certain cases lightweight vmexit handling can schedule() which requires reloading the host state. So we store the host state in the vcpu structure, and reloaded it if we relinquish the vcpu. Signed-off-by: Avi Kivity commit 11bdaf6e26c0cbabd9b6c8f2e9de60190815d348 Author: Avi Kivity Date: Tue May 1 18:24:38 2007 +0300 KVM: Increase mmu shadow cache to 1024 pages This improves kbuild times by about 10%, bringing it within a respectable 25% of native. Signed-off-by: Avi Kivity commit d6540cdffea466f1ee17a52ef530d40577b476b2 Author: Avi Kivity Date: Tue May 1 16:53:31 2007 +0300 KVM: Update shadow pte on write to guest pte A typical demand page/copy on write pattern is: - page fault on vaddr - kvm propagates fault to guest - guest handles fault, updates pte - kvm traps write, clears shadow pte, resumes guest - guest returns to userspace, re-faults on same vaddr - kvm installs shadow pte, resumes guest - guest continues So, three vmexits for a single guest page fault. But if instead of clearing the page table entry, we update to correspond to the value that the guest has just written, we eliminate the third vmexit. This patch does exactly that, reducing kbuild time by about 10%. Signed-off-by: Avi Kivity commit 807762acc40f7cc16aefcfaef8a596a4af988b20 Author: Avi Kivity Date: Tue May 1 16:44:05 2007 +0300 KVM: MMU: Respect nonpae pagetable quadrant when zapping ptes When a guest writes to a page that has an mmu shadow, we have to clear the shadow pte corresponding to the memory location touched by the guest. Now, in nonpae mode, a single guest page may have two or four shadow pages (because a nonpae page maps 4MB or 4GB, whereas the pae shadow maps 2MB or 1GB), so we when we look up the page we find up to three additional aliases for the page. Since we _clear_ the shadow pte, it doesn't matter except for a slight performance penalty, but if we want to _update_ the shadow pte instead of clearing it, it is vital that we don't modify the aliases. Fortunately, exactly which page is needed (the "quadrant") is easily computed, and is accessible in the shadow page header. All we need is to ignore shadow pages from the wrong quadrants. Signed-off-by: Avi Kivity commit 4a5c1655c9f6df8c668428d3c5d2ad4f67dce08d Author: Avi Kivity Date: Tue May 1 14:16:52 2007 +0300 KVM: Unify kvm_mmu_pre_write() and kvm_mmu_post_write() Instead of calling two functions and repeating expensive checks, call one function and provide it with before/after information. Signed-off-by: Avi Kivity commit ff31cf26ff8e17c2f7164c39dc03fe309ed36506 Author: Avi Kivity Date: Tue May 1 11:32:28 2007 +0300 KVM: Be more careful restoring fs on lightweight vmexit i386 wants fs for accessing the pda even on a lightweight exit, so ensure we can always restore it. This fixes a regression on i386 introduced by the lightweight vmexit patch. Signed-off-by: Avi Kivity commit e6d2f6292194c931b2fa11373a66d640245e1b14 Author: Avi Kivity Date: Mon Apr 30 17:05:38 2007 +0300 KVM: Reduce misfirings of the fork detector The kvm mmu tries to detects forks by looking for repeated writes to a page table. If it sees a fork, it unshadows the page table so the page table copying can proceed at native speed instead of being emulated. However, the detector also triggered on simple demand paging access patterns: a linear walk of memory would of course cause repeated writes to the same pagetable page, causing it to unshadow prematurely. Fix by resetting the fork detector if we detect a demand fault. Signed-off-by: Avi Kivity commit f908e27039ab637013ad17c64e4ef77c4c0a24b8 Author: Avi Kivity Date: Mon Apr 30 16:15:58 2007 +0300 KVM: Unindent some code Signed-off-by: Avi Kivity commit 5cf48c367dec74ba8553c53ed332cd075fa38b88 Author: Avi Kivity Date: Mon Apr 30 16:07:54 2007 +0300 KVM: Avoid saving and restoring some host CPU state on lightweight vmexit Many msrs and the like will only be used by the host if we schedule() or return to userspace. Therefore, we avoid saving them if we handle the exit within the kernel, and if a reschedule is not requested. Based on a patch from Eddie Dong with a couple of fixes by me. Signed-off-by: Yaozu(Eddie) Dong Signed-off-by: Avi Kivity commit 2d8d6944a2249f642420bbc70b199182c70ebc9a Author: Avi Kivity Date: Mon Apr 30 14:47:02 2007 +0300 KVM: Assume that writes smaller than 4 bytes are to non-pagetable pages This allows us to remove write protection earlier than otherwise. Should some mad OS choose to use byte writes to update pagetables, it will suffer a performance hit, but still work correctly. Signed-off-by: Avi Kivity commit 7d0e7eed6200c54462e884abc8dd6681df2f5e7d Author: Avi Kivity Date: Mon Apr 30 12:42:43 2007 +0300 KVM: Fix RMW mmio handling Commit 9bf671a47ed6af3164524a31dbef9360f1b66fb5 optimized the mmio read path by returning to the emulator directly after an mmio read request. But we may also need to return back to userspace in case the instruction was a read-modify-write instruction, which means we need to issue a write after completion of the read instead of returning to the guest. Signed-off-by: Avi Kivity commit f05f41f9bb1cf72a13caf61c2931dbbf4bff51eb Author: Anthony Liguori Date: Mon Apr 30 09:48:11 2007 +0300 KVM: SVM: Allow direct guest access to PC debug port The PC debug port is used for IO delay and does not require emulation. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 99c7b51d71c0b0062b752c5f0a4b3498d3d165db Author: He, Qing Date: Mon Apr 30 09:45:24 2007 +0300 KVM: VMX: Enable io bitmaps to avoid IO port 0x80 VMEXITs This patch enables IO bitmaps control on vmx and unmask the 0x80 port to avoid VMEXITs caused by accessing port 0x80. 0x80 is used as delays (see include/asm/io.h), and handling VMEXITs on its access is unnecessary but slows things down. This patch improves kernel build test at around 3%~5%. Because every VM uses the same io bitmap, it is shared between all VMs rather than a per-VM data structure. Signed-off-by: Qing He Signed-off-by: Avi Kivity commit c06d7c14c006c5e2dcd2a7d84603b51e9e60d7a7 Author: Avi Kivity Date: Sun Apr 29 16:25:49 2007 +0300 KVM: Remove unused 'instruction_length' As we no longer emulate in userspace, this is meaningless. We don't compute it on SVM anyway. Signed-off-by: Avi Kivity commit 20426d1309353b3e2771f9c7f534e01ce7a019f2 Author: Avi Kivity Date: Sun Apr 29 15:02:17 2007 +0300 KVM: Don't require explicit indication of completion of mmio or pio It is illegal not to return from a pio or mmio request without completing it, as mmio or pio is an atomic operation. Therefore, we can simplify the userspace interface by avoiding the completion indication. Signed-off-by: Avi Kivity commit 9bf671a47ed6af3164524a31dbef9360f1b66fb5 Author: Avi Kivity Date: Wed Mar 14 15:54:54 2007 +0200 KVM: Remove extraneous guest entry on mmio read When emulating an mmio read, we actually emulate twice: once to determine the physical address of the mmio, and, after we've exited to userspace to get the mmio value, we emulate again to place the value in the result register and update any flags. But we don't really need to enter the guest again for that, only to take an immediate vmexit. So, if we detect that we're doing an mmio read, emulate a single instruction before entering the guest again. Signed-off-by: Avi Kivity commit 8dfdb0d81fb9e858c14e03fd5e007b20167cd065 Author: Avi Kivity Date: Sun Apr 29 13:01:34 2007 +0300 KVM: Remove trailing whitespace Signed-off-by: Avi Kivity commit 1628bcc25417eae4c83ca87e0899c7e02961d975 Author: Signed-off-by: Anthony Liguori Date: Sun Apr 29 11:56:06 2007 +0300 KVM: SVM: Only save/restore MSRs when needed We only have to save/restore MSR_GS_BASE on every VMEXIT. The rest can be saved/restored when we leave the VCPU. Since we don't emulate the DEBUGCTL MSRs and the guest cannot write to them, we don't have to worry about saving/restoring them at all. This shaves a whopping 40% off raw vmexit costs on AMD. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 68ba823bbe6d546e3ceb63d006c62a84e92837db Author: Adrian Bunk Date: Sat Apr 28 21:20:48 2007 +0200 KVM: fix an if() condition It might have worked in this case since PT_PRESENT_MASK is 1, but let's express this correctly. Signed-off-by: Adrian Bunk Signed-off-by: Avi Kivity commit fe7dc1f2c0c3d0c21abf9dfa4387f0b748080688 Author: Anthony Liguori Date: Fri Apr 27 09:29:49 2007 +0300 KVM: VMX: Add lazy FPU support for VT Only save/restore the FPU host state when the guest is actually using the FPU. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 4a579478e5259df8828a8b9e5b3ddac2a946ce88 Author: Anthony Liguori Date: Fri Apr 27 09:29:21 2007 +0300 KVM: VMX: Properly shadow the CR0 register in the vcpu struct Set all of the host mask bits for CR0 so that we can maintain a proper shadow of CR0. This exposes CR0.TS, paving the way for lazy fpu handling. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit aad1187a6c0201701026cdb2f7f6eeb49b2af4a2 Author: Avi Kivity Date: Wed Apr 25 16:57:46 2007 +0300 KVM: Move need_resched() check to common code Pointed out by Anthony Liguori. Signed-off-by: Avi Kivity commit b08487bd204708241c9b71ebfc555e334a4e4711 Author: Eddie Dong Date: Wed Apr 25 16:49:19 2007 +0300 KVM: VMX: Avoid unnecessary vcpu_load()/vcpu_put() cycles By checking if a reschedule is needed, we avoid dropping the vcpu. Signed-off-by: Avi Kivity commit 25900fd20d141145348178ffe91948e47c83e2ab Author: Avi Kivity Date: Wed Apr 25 11:51:06 2007 +0300 KVM: Avoid unused function warning due to assertion removal Signed-off-by: Avi Kivity commit 2bd9b992631841b1be5883a5c27b9c58ae9bb96a Author: Avi Kivity Date: Wed Apr 25 11:48:45 2007 +0300 KVM: We want asserts on debug builds, not release Noticed by Michael Riepe. Signed-off-by: Avi Kivity commit c3efc3ab86aa651106f6302592e25c7ab8285c35 Author: Avi Kivity Date: Thu Apr 12 13:03:01 2007 +0300 KVM: Initialize cr0 to indicate an fpu is present Solaris panics if it sees a cpu with no fpu, and it seems to rely on this bit. Closes sourceforge bug 1698920. Signed-off-by: Avi Kivity commit 28b183145d34a8ad1bc462df565165a88bcb5220 Author: Yaozu Dong Date: Wed Apr 25 14:17:25 2007 +0800 KVM: MMU: Avoid heavy ASSERT at non debug mode. Signed-off-by: Avi Kivity commit 418987aef13b475140b76f9f780046d63eb16f86 Author: Avi Kivity Date: Wed Apr 25 11:01:28 2007 +0300 KVM: Document MSR_K6_STAR's special place in the msr index array Signed-off-by: Avi Kivity commit 90ca9e3d54c8b0ac2023c624d1c7260bb8926beb Author: Avi Kivity Date: Wed Apr 25 10:59:52 2007 +0300 KVM: Don't complain about cpu erratum AA15 It slows down Windows x64 horribly. Signed-off-by: Avi Kivity commit 6f19cb49965e1316b285a443c9392031b1634f2e Author: Avi Kivity Date: Tue Apr 24 14:13:01 2007 +0300 KVM: Fix msr-avoidance regression on Core processors Core processors don't have the STAR msr, so the attempt not to save it caused an underflow in the number of msrs. Fix by only avoiding the STAR msr if it is actually present. Signed-off-by: Avi Kivity commit ccf9e2f22e5caf6274b5e9aafd9814a32ef049d5 Author: Anthony Liguori Date: Mon Apr 23 09:17:21 2007 -0500 KVM: Lazy FPU support for SVM Avoid saving and restoring the guest fpu state on every exit. This shaves ~100 cycles off the guest/host switch. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit d558e0b49319cfc9aa92e9b7215580f265a2ead7 Author: Avi Kivity Date: Sun Apr 22 15:28:19 2007 +0300 KVM: Allow passing 64-bit values to the emulated read/write API This simplifies the API somewhat (by eliminating the special-case cmpxchg8b on i386). Signed-off-by: Avi Kivity commit 551284356a39f20de70cd5556e85ae92080aec8c Author: Avi Kivity Date: Fri Apr 20 13:41:09 2007 +0300 KVM: Silence compile warning on i386 Signed-off-by: Avi Kivity commit 459377fe9ba4a307144ead3ad86993cdee9f8fe8 Author: Avi Kivity Date: Thu Apr 19 17:27:43 2007 +0300 KVM: Per-vcpu statistics Make the exit statistics per-vcpu instead of global. This gives a 3.5% boost when running one virtual machine per core on my two socket dual core (4 cores total) machine. Signed-off-by: Avi Kivity commit 5c828f83928f186320d74627089122ebc9ea98ce Author: Avi Kivity Date: Thu Apr 19 14:28:44 2007 +0300 KVM: VMX: Only save/restore MSR_K6_STAR if necessary Intel hosts only support syscall/sysret in long more (and only if efer.sce is enabled), so only reload the related MSR_K6_STAR if the guest will actually be able to use it. This reduces vmexit cost by about 500 cycles (6400 -> 5870) on my setup. Signed-off-by: Avi Kivity commit 37d6247b3636cbf47014694483d2d25c3806e8f2 Author: Avi Kivity Date: Thu Apr 19 13:26:39 2007 +0300 KVM: Fold drivers/kvm/kvm_vmx.h into drivers/kvm/vmx.c No meat in that file. Signed-off-by: Avi Kivity commit ba9c2fc1015a2b2f1f930274d465662ed8b860e6 Author: Avi Kivity Date: Thu Apr 19 13:22:48 2007 +0300 KVM: VMX: Don't switch 64-bit msrs for 32-bit guests Some msrs are only used by x86_64 instructions, and are therefore not needed when the guest is legacy mode. By not bothering to switch them, we reduce vmexit latency by 2400 cycles (from about 8800) when running a 32-bt guest on a 64-bit host. Signed-off-by: Avi Kivity commit 8d6c8a0d891f8c37889f28f368c2621f85e50035 Author: Avi Kivity Date: Wed Apr 18 11:18:18 2007 +0300 KVM: Fix off-by-one when writing to a nonpae guest pde Nonpae guest pdes are shadowed by two pae ptes, so we double the offset twice: once to account for the pte size difference, and once because we need to shadow pdes for a single guest pde. But when writing to the upper guest pde we also need to truncate the lower bits, otherwise the multiply shifts these bits into the pde index and causes an access to the wrong shadow pde. If we're at the end of the page (accessing the very last guest pde) we can even overflow into the next host page and oops. Signed-off-by: Avi Kivity commit f0b9c908fa1451147a07f2f4e4a9409fb7b14160 Author: Avi Kivity Date: Tue Apr 17 15:30:24 2007 +0300 KVM: VMX: Reduce unnecessary saving of host msrs THe automatically switched msrs are never changed on the host (with the exception of MSR_KERNEL_GS_BASE) and thus there is no need to save them on every vm entry. This reduces vmexit latency by ~400 cycles on i386 and by ~900 cycles (10%) on x86_64. Signed-off-by: Avi Kivity commit 7368e6550cdf72b0ad1b68dbe923f85e37ef4d08 Author: Avi Kivity Date: Tue Apr 17 10:53:22 2007 +0300 KVM: Handle guest page faults when emulating mmio Usually, guest page faults are detected by the kvm page fault handler, which detects if they are shadow faults, mmio faults, pagetable faults, or normal guest page faults. However, in ceratin circumstances, we can detect a page fault much later. One of these events is the following combination: - A two memory operand instruction (e.g. movsb) is executed. - The first operand is in mmio space (which is the fault reported to kvm) - The second operand is in an ummaped address (e.g. a guest page fault) The Windows 2000 installer does such an access, an promptly hangs. Fix by adding the missing page fault injection on that path. Signed-off-by: Avi Kivity commit 894f5a5efc0c48482eb10ad48891054a659e5941 Author: Avi Kivity Date: Mon Apr 16 14:28:40 2007 +0300 KVM: SVM: Report hardware exit reason to userspace instead of dmesg Signed-off-by: Avi Kivity commit 94d806a6efd4401ce43358af6a9e8df5a63151ae Author: Avi Kivity Date: Mon Apr 16 13:36:10 2007 +0300 KVM: Fix pio completion Check cur_count instead of count to avoid false completions. Signed-off-by: Avi Kivity commit d3344ae6f6293913d6e4f230ebee0b370f2e3f98 Author: Avi Kivity Date: Mon Apr 16 11:53:17 2007 +0300 KVM: Retry sleeping allocation if atomic allocation fails This avoids -ENOMEM under memory pressure. Signed-off-by: Avi Kivity commit 327585c3b4c1d6b04bb752f70f350d98ca855080 Author: Avi Kivity Date: Sun Apr 15 16:31:09 2007 +0300 KVM: Use slab caches to allocate mmu data structures Better leak detection, statistics, memory use, speed -- goodness all around. Signed-off-by: Avi Kivity commit 3079541923d2cdf702490eff7081610b7320e37f Author: Avi Kivity Date: Sun Apr 15 15:48:11 2007 +0300 KVM: Fix string pio when count == 0 Surprisingly, VT traps when executing a string pio instruction with zero count. Perhaps more surprisingly, the Windows ne2000 driver issues such instructions. Since we aren't prepared to handle completions of these instructions, avoid the entire mess by continuing execution without escaping to userspace. This fixes the networking problems reported by Leslie Mann with recent versions of kvm. Signed-off-by: Avi Kivity commit 3ef1110c81993e01343e1b473f5d7d1a23e6a8a3 Author: Avi Kivity Date: Thu Apr 12 17:35:58 2007 +0300 KVM: Handle partial pae pdptr Some guests (Solaris) do not set up all four pdptrs, but leave some invalid. kvm incorrectly treated these as valid page directories, pinning the wrong pages and causing general confusion. Fix by checking the valid bit of a pae pdpte. This closes sourceforge bug 1698922. Signed-off-by: Avi Kivity commit 4e9d9d330d9c9e66c449be10950562e407366a73 Author: Avi Kivity Date: Wed Apr 11 19:04:39 2007 +0300 KVM: Fix memory leak on pio completion We get_page() the pages participating in pio before we return to userspace, yet we neglect to free them. The can leak all guest memory in a few seconds by doing a hdparm -d 0 /dev/hda; dd < /dev/hda > /dev/null on the guest. Signed-off-by: Avi Kivity commit b630b9c6819844e29cddcfeaee901f6ada5d571b Author: Eric Sesterhenn / Snakebyte Date: Mon Apr 9 16:15:05 2007 +0200 KVM: Fix overflow bug in overflow detection code The expression sp - 6 < sp where sp is a u16 is undefined in C since 'sp - 6' is promoted to int, and signed overflow is undefined in C. gcc 4.2 actually warns about it. Replace with a simpler test. Signed-off-by: Eric Sesterhenn Signed-off-by: Avi Kivity commit c338c271f150ab2ded369ef4c1882f85b28af709 Author: Avi Kivity Date: Mon Apr 2 13:05:50 2007 +0300 KVM: Use kernel-standard types Noted by Joerg Roedel. Signed-off-by: Avi Kivity commit 0ea6eecef44923d66409a49d71e4fa87fa0f5bed Author: Avi Kivity Date: Sun Apr 1 16:34:31 2007 +0300 KVM: Add fpu get/set operations These are really helpful when migrating an floating point app to another machine. Signed-off-by: Avi Kivity commit 05671a064c73b8cb8966ddd037ece2d6ae2cb75b Author: Avi Kivity Date: Fri Mar 30 16:54:30 2007 +0300 KVM: Add physical memory aliasing feature With this, we can specify that accesses to one physical memory range will be remapped to another. This is useful for the vga window at 0xa0000 which is used as a movable window into the (much larger) framebuffer. Signed-off-by: Avi Kivity commit 8e08039818b6a5b8c81b905f863adaa18d774171 Author: Avi Kivity Date: Fri Mar 30 14:02:32 2007 +0300 KVM: Simply gfn_to_page() Mapping a guest page to a host page is a common operation. Currently, one has first to find the memory slot where the page belongs (gfn_to_memslot), then locate the page itself (gfn_to_page()). This is clumsy, and also won't work well with memory aliases. So simplify gfn_to_page() not to require memory slot translation first, and instead do it internally. Signed-off-by: Avi Kivity commit 66a9932c55ff7240955d57b7d1e62178a9e80868 Author: Dor Laor Date: Fri Mar 30 13:06:33 2007 +0300 Add mmu cache clear function Functions that play around with the physical memory map need a way to clear mappings to possibly nonexistent or invalid memory. Both the mmu cache and the processor tlb are cleared. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 6095d7b8291fc3e05f3b8790a9bc86b54af281a2 Author: Joerg Roedel Date: Fri Mar 30 17:02:14 2007 +0300 KVM: SVM: enable LBRV virtualization if available This patch enables the virtualization of the last branch record MSRs on SVM if this feature is available in hardware. It also introduces a small and simple check feature for specific SVM extensions. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 8f1469e8477bea483d5a6348a30a534449048c8d Author: Avi Kivity Date: Wed Mar 28 20:04:16 2007 +0200 KVM: x86 emulator: fix bit string operations operand size On x86, bit operations operate on a string of bits that can reside in multiple words. For example, 'btsl %eax, (blah)' will touch the word at blah+4 if %eax is between 32 and 63. The x86 emulator compensates for that by advancing the operand address by (bit offset / BITS_PER_LONG) and truncating the bit offset to the range (0..BITS_PER_LONG-1). This has a side effect of forcing the operand size to 8 bytes on 64-bit hosts. Now, a 32-bit guest goes and fork()s a process. It write protects a stack page at 0xbffff000 using the 'btr' instruction, at offset 0xffc in the page table, with bit offset 1 (for the write permission bit). The emulator now forces the operand size to 8 bytes as previously described, and an innocent page table update turns into a cross-page-boundary write, which is assumed by the mmu code not to be a page table, so it doesn't actually clear the corresponding shadow page table entry. The guest and host permissions are out of sync and guest memory is corrupted soon afterwards, leading to guest failure. Fix by not using BITS_PER_LONG as the word size; instead use the actual operand size, so we get a 32-bit write in that case. Note we still have to teach the mmu to handle cross-page-boundary writes to guest page table; but for now this allows Damn Small Linux 0.4 (2.4.20) to boot. Signed-off-by: Avi Kivity commit e3a065c4e99bb8282d72a2c3c75234d7d7408be6 Author: Avi Kivity Date: Tue Mar 27 17:50:20 2007 +0200 KVM: Remove debug message No longer interesting. Signed-off-by: Avi Kivity commit 19cd40d605bb99fc9058973a69ef208c8b5b1e42 Author: Avi Kivity Date: Tue Mar 27 16:12:41 2007 +0200 Revert "added KVM_GET_MEM_MAP ioctl to get the memory bitmap for a memory slot" This reverts commit ade11a015f83d270d1201c440199146f852fe5e4. As the balloon path will be through qemu, it will have direct knowledge of released gfns, so this API is not directly needed. If it becomes useful in the future, it will be un-reverted. Signed-off-by: Avi Kivity commit 932bf20c0c2075f958bb86b481d8f359197b4d6a Author: Avi Kivity Date: Mon Mar 26 19:31:52 2007 +0200 KVM: Use list_move() Use list_move() where possible. Noticed by Dor Laor. Signed-off-by: Avi Kivity commit 31e82571e8a77d5feb1093627ef0b31f28649590 Author: Michal Piotrowski Date: Sun Mar 25 17:59:32 2007 +0200 KVM: Remove unused function Remove unused function CC drivers/kvm/svm.o drivers/kvm/svm.c:207: warning: ‘inject_db’ defined but not used Signed-off-by: Michal Piotrowski Signed-off-by: Avi Kivity commit 9207113c121519986a114ee5c498184e618ffd68 Author: Avi Kivity Date: Sun Mar 25 12:07:27 2007 +0200 KVM: SVM: Ensure timestamp counter monotonicity When a vcpu is migrated from one cpu to another, its timestamp counter may lose its monotonic property if the host has unsynced timestamp counters. This can confuse the guest, sometimes to the point of refusing to boot. As the rdtsc instruction is rather fast on AMD processors (7-10 cycles), we can simply record the last host tsc when we drop the cpu, and adjust the vcpu tsc offset when we detect that we've migrated to a different cpu. Signed-off-by: Avi Kivity commit b40faf227eb371a52aa21d08f8e9c33fc06602b4 Author: Avi Kivity Date: Fri Mar 23 09:55:25 2007 +0200 KVM: MMU: Fix hugepage pdes mapping same physical address with different access The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map the same physical address, they share the same shadow page. This is a fairly common case (kernel mappings on i386 nonpae Linux, for example). However, if the two pdes map the same memory but with different permissions, kvm will happily use the cached shadow page. If the access through the more permissive pde will occur after the access to the strict pde, an endless pagefault loop will be generated and the guest will make no progress. Fix by making the access permissions part of the cache lookup key. The fix allows Xen pae to boot on kvm and run guest domains. Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix. Signed-off-by: Avi Kivity commit 061bba1190514205594d2046f5dc31a01a135163 Author: Avi Kivity Date: Thu Mar 22 15:10:32 2007 +0200 Revert "KVM: Remove extraneous guest entry on mmio read" This reverts commit b0092d187cfa19dfcada3b85d728af5ae27989dc. While the optimization is sound, it regresses booting the Fedora Core 6 32 bit kernel. Signed-off-by: Avi Kivity commit 4cec1674d1436157c7dcc2b5b6f625b08b2b96e8 Author: Joerg Roedel Date: Wed Mar 21 19:47:00 2007 +0100 KVM: SVM: forbid guest to execute monitor/mwait This patch forbids the guest to execute monitor/mwait instructions on SVM. This is necessary because the guest can execute these instructions if they are available even if the kvm cpuid doesn't report its existence. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 7921ad9e303f3f03dd81b552e3b0cd87ef355219 Author: Sergey Kiselev Date: Thu Mar 22 14:06:18 2007 +0200 KVM: Handle writes to MCG_STATUS msr Some older (~2.6.7) kernels write MCG_STATUS register during kernel boot (mce_clear_all() function, called from mce_init()). It's not currently handled by kvm and will cause it to inject a GPF. Following patch adds a "nop" handler for this. Signed-off-by: Sergey Kiselev Signed-off-by: Avi Kivity commit 36809e1326c13887d324025d4592958ead8758d5 Author: Avi Kivity Date: Wed Mar 21 18:14:42 2007 +0200 KVM: Remove unused and write-only variables Trivial cleanup. Signed-off-by: Avi Kivity commit 262e17b818054dad314a062a439681d79a336d48 Author: Avi Kivity Date: Wed Mar 21 18:11:36 2007 +0200 KVM: Don't allow the guest to turn off the cpu cache The cpu cache is a host resource; the guest should not be able to turn it off (even for itself). Signed-off-by: Avi Kivity commit 8c37a70d93ba3e4286ad7524f7915a32ed39cac9 Author: Avi Kivity Date: Wed Mar 21 17:58:32 2007 +0200 KVM: Hack real-mode segments on vmx from KVM_SET_SREGS As usual, we need to mangle segment registers when emulating real mode as vm86 has specific constraints. We special case the reset segment base, and set the "access rights" (or descriptor flags) to vm86 comaptible values. This fixes reboot on vmx. Signed-off-by: Avi Kivity commit 0bf8d346418255335dc9062d96b9f8814b471690 Author: Avi Kivity Date: Wed Mar 21 13:44:58 2007 +0200 KVM: Modify guest segments after potentially switching modes The SET_SREGS ioctl modifies both cr0.pe (real mode/protected mode) and guest segment registers. Since segment handling is modified by the mode on Intel procesors, update the segment registers after the mode switch has taken place. Signed-off-by: Avi Kivity commit f97af70b3aa8a92ddeabb7d42477e7d13dd0a192 Author: Avi Kivity Date: Tue Mar 20 18:44:51 2007 +0200 KVM: Remove set_cr0_no_modeswitch() arch op set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers. As we now cache the protected mode values on entry to real mode, this isn't an issue anymore, and it interferes with reboot (which usually _is_ a modeswitch). Signed-off-by: Avi Kivity commit e314dde30e3851e8effc017c6fffced11d90183a Author: Avi Kivity Date: Tue Mar 20 18:40:40 2007 +0200 KVM: Workaround vmx inability to virtualize the reset state The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000, which aren't compatible with vm86 mode, which is used for real mode virtualization. When we create a vcpu, we set cs.base to 0xf0000, but if we get there by way of a reset, the values are inconsistent and vmx refuses to enter guest mode. Workaround by detecting the state and munging it appropriately. Signed-off-by: Avi Kivity commit 88aea7ddfae755633b0a80ccfa56244b3c79c7b0 Author: Avi Kivity Date: Tue Mar 20 14:34:28 2007 +0200 KVM: MMU: Remove global pte tracking The initial, noncaching, version of the kvm mmu flushed the all nonglobal shadow page table translations (much like a native tlb flush). The new implementation flushes translations only when they change, rendering global pte tracking superfluous. This removes the unused tracking mechanism and storage space. Signed-off-by: Avi Kivity commit 66e5d5c81b5b89e39aa86e3bf9864d228f468b0d Author: Avi Kivity Date: Tue Mar 20 14:29:06 2007 +0200 KVM: MMU: Remove unnecessary check for pdptr access We already special case the pdptr access, so no need to check it again. Signed-off-by: Avi Kivity commit c01571ed56754dfea458cc37d553c360082411a1 Author: Avi Kivity Date: Tue Mar 20 12:46:50 2007 +0200 KVM: Avoid guest virtual addresses in string pio userspace interface The current string pio interface communicates using guest virtual addresses, relying on userspace to translate addresses and to check permissions. This interface cannot fully support guest smp, as the check needs to take into account two pages at one in case an unaligned string transfer straddles a page boundary. Change the interface not to communicate guest addresses at all; instead use a buffer page (mmaped by userspace) and do transfers there. The kernel manages the virtual to physical translation and can perform the checks atomically by taking the appropriate locks. Signed-off-by: Avi Kivity commit 74c24de6e7848a45d6109d987d4fd2ccd83e432e Author: Avi Kivity Date: Wed Mar 7 13:11:17 2007 +0200 KVM: Future-proof argument-less ioctls Some ioctls ignore their arguments. By requiring them to be zero now, we allow a nonzero value to have some special meaning in the future. Signed-off-by: Avi Kivity commit 29e686a1dc9631b7898d087a0ab1c4716672e209 Author: Avi Kivity Date: Wed Mar 7 13:05:38 2007 +0200 KVM: Allow kernel to select size of mmap() buffer This allows us to store offsets in the kernel/user kvm_run area, and be sure that userspace has them mapped. As offsets can be outside the kvm_run struct, userspace has no way of knowing how much to mmap. Signed-off-by: Avi Kivity commit cce3a1062817218c67163732339e2ea25e9f023b Author: Avi Kivity Date: Mon Mar 5 19:46:05 2007 +0200 KVM: Add guest mode signal mask Allow a special signal mask to be used while executing in guest mode. This allows signals to be used to interrupt a vcpu without requiring signal delivery to a userspace handler, which is quite expensive. Userspace still receives -EINTR and can get the signal via sigwait(). Signed-off-by: Avi Kivity commit cd3aaa2392baec9674792d71d304ec41e540b517 Author: Avi Kivity Date: Mon Mar 5 17:45:40 2007 +0200 KVM: Initialize the apic_base msr on svm too Older userspace didn't care, but newer userspace (with the cpuid changes) does. Signed-off-by: Avi Kivity commit c303c0efc5b2ff8c0f77c9079fa66f62801da93d Author: Avi Kivity Date: Sun Mar 4 14:24:03 2007 +0200 KVM: Add a special exit reason when exiting due to an interrupt This is redundant, as we also return -EINTR from the ioctl, but it allows us to examine the exit_reason field on resume without seeing old data. Signed-off-by: Avi Kivity commit 62919332e00e3226dd1f728ff83107d06a6d9a81 Author: Avi Kivity Date: Sun Mar 4 14:17:08 2007 +0200 KVM: Fold kvm_run::exit_type into kvm_run::exit_reason Currently, userspace is told about the nature of the last exit from the guest using two fields, exit_type and exit_reason, where exit_type has just two enumerations (and no need for more). So fold exit_type into exit_reason, reducing the complexity of determining what really happened. Signed-off-by: Avi Kivity commit 9e16898f4f5d6cdc35030bb272631611b71548fe Author: Avi Kivity Date: Sun Mar 4 13:59:30 2007 +0200 KVM: Allow userspace to process hypercalls which have no kernel handler This is useful for paravirtualized graphics devices, for example. Signed-off-by: Avi Kivity commit 440fd9098bceb2ca0856d962ff62db9af4d1094a Author: Avi Kivity Date: Thu Mar 1 17:56:20 2007 +0200 KVM: Add method to check for backwards-compatible API extensions Signed-off-by: Avi Kivity commit 0b37dedb178bcb3b0a28f65e6ae835bf58184301 Author: Avi Kivity Date: Thu Mar 1 17:20:13 2007 +0200 KVM: Renumber ioctls The recent changes have left the ioctl numbers in complete disarray. Signed-off-by: Avi Kivity commit 95cab16b18e1c1a786a9fc5ea6fcd68b29ae3481 Author: Avi Kivity Date: Thu Mar 1 16:47:06 2007 +0200 KVM: Remove minor wart from KVM_CREATE_VCPU ioctl That ioctl does not transfer any data, so it should be an _IO rather than an _IOW. Signed-off-by: Avi Kivity commit ba5cb15b027b76ba7b4d247914eb6d20065c0767 Author: Avi Kivity Date: Thu Mar 1 16:20:40 2007 +0200 KVM: Remove the 'emulated' field from the userspace interface We no longer emulate single instructions in userspace. Instead, we service mmio or pio requests. Signed-off-by: Avi Kivity commit 706e8fe655be36aa686f1fbb398d3a4470d4939b Author: Avi Kivity Date: Wed Feb 28 20:46:53 2007 +0200 KVM: Handle cpuid in the kernel instead of punting to userspace KVM used to handle cpuid by letting userspace decide what values to return to the guest. We now handle cpuid completely in the kernel. We still let userspace decide which values the guest will see by having userspace set up the value table beforehand (this is necessary to allow management software to set the cpu features to the least common denominator, so that live migration can work). The motivation for the change is that kvm kernel code can be impacted by cpuid features, for example the x86 emulator. Signed-off-by: Avi Kivity commit aad2f6e0faf4b03e087bbe6751acdacd72e911b6 Author: Avi Kivity Date: Thu Feb 22 19:48:43 2007 +0200 KVM: Initialize PIO I/O count This allows userspace to ignore the io.rep field. No a big deal, but friendly. Signed-off-by: Avi Kivity commit e668cf946ee8654c7f5afe3feeed686a3566c22a Author: Avi Kivity Date: Thu Feb 22 19:39:30 2007 +0200 KVM: Do not communicate to userspace through cpu registers during PIO Currently when passing the a PIO emulation request to userspace, we rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx (on string instructions). This (a) requires two extra ioctls for getting and setting the registers and (b) is unfriendly to non-x86 archs, when they get kvm ports. So fix by doing the register fixups in the kernel and passing to userspace only an abstract description of the PIO to be done. Signed-off-by: Avi Kivity commit 3de857cd1335bd2e02b60d3a50b7da93ccbabf1d Author: Avi Kivity Date: Thu Feb 22 12:58:31 2007 +0200 KVM: Use a shared page for kernel/user communication when runing a vcpu Instead of passing a 'struct kvm_run' back and forth between the kernel and userspace, allocate a page and allow the user to mmap() it. This reduces needless copying and makes the interface expandable by providing lots of free space. Signed-off-by: Avi Kivity commit 128e159e11e999496ec44a549fcac91de3802389 Author: Avi Kivity Date: Mon Mar 19 13:18:10 2007 +0200 KVM: Prevent system selectors leaking into guest on real->protected mode transition on vmx Intel virtualization extensions do not support virtualizing real mode. So kvm uses virtualized vm86 mode to run real mode code. Unfortunately, this virtualized vm86 mode does not support the so called "big real" mode, where the segment selector and base do not agree with each other according to the real mode rules (base == selector << 4). To work around this, kvm checks whether a selector/base pair violates the virtualized vm86 rules, and if so, forces it into conformance. On a transition back to protected mode, if we see that the guest did not touch a forced segment, we restore it back to the original protected mode value. This pile of hacks breaks down if the gdt has changed in real mode, as it can cause a segment selector to point to a system descriptor instead of a normal data segment. In fact, this happens with the Windows bootloader and the qemu acpi bios, where a protected mode memcpy routine issues an innocent 'pop %es' and traps on an attempt to load a system descriptor. "Fix" by checking if the to-be-restored selector points at a system segment, and if so, coercing it into a normal data segment. The long term solution, of course, is to abandon vm86 mode and use emulation for big real mode. Signed-off-by: Avi Kivity commit ade11a015f83d270d1201c440199146f852fe5e4 Author: Uri Lublin Date: Wed Mar 14 19:21:06 2007 +0200 added KVM_GET_MEM_MAP ioctl to get the memory bitmap for a memory slot To be used when there may be "holes" in the memory. Specifically to not break VM migration when ballooning mechanism exists Signed-off-by: Uri Lublin commit b0092d187cfa19dfcada3b85d728af5ae27989dc Author: Avi Kivity Date: Wed Mar 14 15:54:54 2007 +0200 KVM: Remove extraneous guest entry on mmio read When emulating an mmio read, we actually emulate twice: once to determine the physical address of the mmio, and, after we've exited to userspace to get the mmio value, we emulate again to place the value in the result register and update any flags. But we don't really need to enter the guest again for that, only to take an immediate vmexit. So, if we detect that we're doing an mmio read, emulate a single instruction before entering the guest again. Signed-off-by: Avi Kivity commit 470db88b8b3491199e8d55b771d66e74b2fd53cd Author: Ingo Molnar Date: Sun Mar 11 13:52:33 2007 +0100 KVM: always reload segment selectors failed VM entry on VMX might still change %fs or %gs, thus make sure that KVM always reloads the segment selectors. This is crutial on both x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work. Signed-off-by: Ingo Molnar commit f7edc6a39584a3f95687a5320675fadb23bccbe5 Author: Ingo Molnar Date: Sat Mar 10 11:22:51 2007 +0100 KVM: trivial whitespace fixes trivial whitespace fixes. Signed-off-by: Ingo Molnar commit f3a33bfeaa5cade1a9ac1facb5cb904a483b1e5c Author: Avi Kivity Date: Fri Mar 9 13:04:31 2007 +0200 KVM: MMU: Fix host memory corruption on i386 with >= 4GB ram PAGE_MASK is an unsigned long, so using it to mask physical addresses on i386 (which are 64-bit wide) leads to truncation. This can result in page->private of unrelated memory pages being modified, with disasterous results. Fix by not using PAGE_MASK for physical addresses; instead calculate the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON(). Signed-off-by: Avi Kivity commit 6ee9853b015f8807f497ffad39b142ddc1403aa9 Author: Avi Kivity Date: Thu Mar 8 17:13:32 2007 +0200 KVM: MMU: Fix guest writes to nonpae pde KVM shadow page tables are always in pae mode, regardless of the guest setting. This means that a guest pde (mapping 4MB of memory) is mapped to two shadow pdes (mapping 2MB each). When the guest writes to a pte or pde, we intercept the write and emulate it. We also remove any shadowed mappings corresponding to the write. Since the mmu did not account for the doubling in the number of pdes, it removed the wrong entry, resulting in a mismatch between shadow page tables and guest page tables, followed shortly by guest memory corruption. This patch fixes the problem by detecting the special case of writing to a non-pae pde and adjusting the address and number of shadow pdes zapped accordingly. Signed-off-by: Avi Kivity commit 374c1509c7d04a4e351b1812c2f0b9dac3ea0c0a Author: Avi Kivity Date: Thu Mar 8 11:48:09 2007 +0200 KVM: Fix bogus sign extension in mmu mapping audit When auditing a 32-bit guest on a 64-bit host, sign extension of the page table directory pointer table index caused bogus addresses to be shown on audit errors. Fix by declaring the index unsigned. Signed-off-by: Avi Kivity commit fac539542cbf923a39238b10557c88f99fd45b59 Author: Avi Kivity Date: Wed Mar 7 09:29:48 2007 +0200 KVM: Export This allows users to actually build prgrams that use kvm without the entire source tree. Signed-off-by: Avi Kivity commit c14a46343cc9f04f15ebc67573031fe8bbe1555a Author: Avi Kivity Date: Tue Mar 6 12:05:53 2007 +0200 KVM: Fix guest sysenter on vmx The vmx code currently treats the guest's sysenter support msrs as 32-bit values, which breaks 32-bit compat mode userspace on 64-bit guests. Fix by using the native word width of the machine. Signed-off-by: Avi Kivity commit ea135e7671189ffb7e67843bf98740dac0c6ccfa Author: Avi Kivity Date: Sun Mar 4 13:27:36 2007 +0200 KVM: Use own minor number Use the minor number (232) allocated to kvm by lanana. Signed-off-by: Avi Kivity commit 21af17507f37658414191b1cf1337efbaf7dd530 Author: Dor Laor Date: Mon Feb 19 18:25:43 2007 +0200 KVM: Use the generic skip_emulated_instruction() in hypercall code Instead of twiddling the rip registers directly, use the skip_emulated_instruction() function to do that for us. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 57d78025d84fb607aa335d015a79b257517aa209 Author: Dor Laor Date: Mon Feb 19 16:44:49 2007 +0200 KVM: Fix guest register corruption on paravirt hypercall The hypercall code mixes up the ->cache_regs() and ->decache_regs() callbacks, resulting in guest register corruption. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 28e9803c9134683a884efe05abdb3f814c1ca7e7 Author: Avi Kivity Date: Thu Mar 1 19:21:03 2007 +0200 KVM: Unset kvm_arch_ops if arch module loading failed Otherwise, the core module thinks the arch module is loaded, and won't let you reload it after you've fixed the bug. Signed-off-by: Avi Kivity commit 426bc2fd1462706ec92d0e9efdb0cf3643f4eb67 Author: Avi Kivity Date: Thu Mar 1 11:28:13 2007 +0200 KVM: Move kvmfs magic number to From: Andrew Morton Use the standard magic.h for kvmfs. Cc: Avi Kivity Signed-off-by: Andrew Morton Signed-off-by: Avi Kivity commit c1a8557e1da6e7d8bf8f77cb1b47c077f5c2a67d Author: Avi Kivity Date: Mon Feb 26 16:29:43 2007 +0200 KVM: Fix bogus failure in kvm.ko module initialization A bogus 'return r' can cause an otherwise successful module load to fail. This both denies users the use of kvm, and it also denies them the use of their machine, as it leaves a filesystem registered with its callbacks pointing into now-freed module memory. Fix by returning a zero like a good module. Thanks to Richard Lucassen (?) for reporting the problem and for providing access to a machine which exhibited it. Signed-off-by: Avi Kivity commit 7703ff91ee2ed171f2175d030e7f063c4efab2f5 Author: Uri Lublin Date: Thu Feb 22 17:37:32 2007 +0200 KVM: Remove write access permissions when dirty-page-logging is enabled Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl. If the memory region already exists, we need to remove write accesses, so writes will be caught, and dirty pages will be logged. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit b77fd1f62576463434fc434cbdcd808847e169a1 Author: Uri Lublin Date: Thu Feb 22 17:15:33 2007 +0200 kvm: move do_remove_write_access() up To be called from kvm_vm_ioctl_set_memory_region() Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 62e287e7210d6ff142b3b05233fa1f5df686b794 Author: Uri Lublin Date: Thu Feb 22 16:43:09 2007 +0200 KVM: Fix dirty page log bitmap size/access calculation Since dirty_bitmap is an unsigned long array, the alignment and size need to take that into account. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 871574eb14e959c19d94fdee7c3e2b88ae06770f Author: Uri Lublin Date: Wed Feb 21 18:25:21 2007 +0200 KVM: Add missing calls to mark_page_dirty() A few places where we modify guest memory fail to call mark_page_dirty(), causing live migration to fail. This adds the missing calls. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 42017e8bf8eb7b6f65b95bca1368ee274fc5ef50 Author: Uri Lublin Date: Thu Feb 22 17:37:32 2007 +0200 kvm: dirty page logging: remove write access permissions when dirty-page-logging is enabled Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl. If the memory region already exists, there is a need to remove write accesses, so writes will be caught, and dirty pages will be logged. commit a9fd29cfcb643b97cd76c7d836be4d0ed80f69e0 Author: Uri Lublin Date: Thu Feb 22 17:15:33 2007 +0200 kvm: move do_remove_write_access() up To be called from kvm_vm_ioctl_set_memory_region() commit fba4ba9c513ad2cd328f5f16980aa7b90d40cec0 Author: Uri Lublin Date: Thu Feb 22 16:43:09 2007 +0200 kvm: dirty pages log: fix bitmap size/access calculation Since dirty_bitmap is an unsigned long array (pointer) commit ae160d732685ab33d5a3a495663aa2b54c4d4734 Author: Uri Lublin Date: Thu Feb 22 15:47:42 2007 +0200 .gitignore: ignore emacs backup files (*~) commit 8267c1cd9a8a038e91c94e0cabc571a3614dc3e5 Author: Avi Kivity Date: Wed Feb 21 19:47:40 2007 +0200 KVM: Bump API version Signed-off-by: Avi Kivity commit c65237e78c19b8173338a49933c611dece13c1c6 Author: Avi Kivity Date: Wed Feb 21 18:04:26 2007 +0200 KVM: Per-vcpu inodes Allocate a distinct inode for every vcpu in a VM. This has the following benefits: - the filp cachelines are no longer bounced when f_count is incremented on every ioctl() - the API and internal code are distinctly clearer; for example, on the KVM_GET_REGS ioctl, there is no need to copy the vcpu number from userspace and then copy the registers back; the vcpu identity is derived from the fd used to make the call Right now the performance benefits are completely theoretical since (a) we don't support more than one vcpu per VM and (b) virtualization hardware inefficiencies completely everwhelm any cacheline bouncing effects. But both of these will change, and we need to prepare the API today. Signed-off-by: Avi Kivity commit 11c1297fadc533d1f66252088b4f4775018bafbb Author: Avi Kivity Date: Tue Feb 20 18:41:05 2007 +0200 KVM: Move kvm_vm_ioctl_create_vcpu() around In preparation of some hacking. Signed-off-by: Avi Kivity commit f3ad84386727171d8308338a2c5dee1deac2e50d Author: Avi Kivity Date: Tue Feb 20 18:27:58 2007 +0200 KVM: Rename some kvm_dev_ioctl_*() functions to kvm_vm_ioctl_*() This reflects the changed scope, from device-wide to single vm (previously every device open created a virtual machine). Signed-off-by: Avi Kivity commit 733e3f74f1c51bbc2e7a99df8b51767504b58de2 Author: Avi Kivity Date: Wed Feb 21 19:28:04 2007 +0200 KVM: Create an inode per virtual machine This avoids having filp->f_op and the corresponding inode->i_fop different, which is a little unorthodox. The ioctl list is split into two: global kvm ioctls and per-vm ioctls. A new ioctl, KVM_CREATE_VM, is used to create VMs and return the VM fd. Signed-off-by: Avi Kivity commit 52a96114380f8ab615626e4cec57b7015895bd0f Author: Avi Kivity Date: Tue Feb 20 14:07:37 2007 +0200 KVM: Add internal filesystem for generating inodes The kvmfs inodes will represent virtual machines and vcpus, as necessary, reducing cacheline bouncing due to inodes and filps being shared. Signed-off-by: Avi Kivity commit b00bc8b10197715f5b842f1f9a60e67a3484b10f Author: Uri Lublin Date: Wed Feb 21 18:25:21 2007 +0200 kvm, dirty pages log: adding some calls to mark_page_dirty() commit 58a214eba321d92f833221c26777e2119e34a19d Author: Avi Kivity Date: Mon Feb 19 14:37:48 2007 +0200 KVM: More 0 -> NULL conversions Signed-off-by: Avi Kivity commit f73199bb57b4c8feb7d8f60c6f1a25107de18dab Author: Joerg Roedel Date: Mon Feb 19 14:37:47 2007 +0200 KVM: SVM: intercept SMI to handle it at host level This patch changes the SVM code to intercept SMIs and handle it outside the guest. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit fa2742c78f10fad8682e3af17df3e9fc2eece9e4 Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: svm: init cr0 with the wp bit set Signed-off-by: Avi Kivity commit 8da588a919dc0bef76e384d16fd13ea2189aa82d Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Wire up hypercall handlers to a central arch-independent location Signed-off-by: Avi Kivity commit 68f16784f188d280c75b39e2367ebc1adbc66d9d Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Add hypercall host support for svm Signed-off-by: Avi Kivity commit 7c8bd4d6fc0e2bfb35cd4c0e8ff39c4f8972d951 Author: Ingo Molnar Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Add host hypercall support for vmx Signed-off-by: Avi Kivity commit f846fa34a14ec37dc0194c6f47ea4374c140e6f1 Author: Ingo Molnar Date: Mon Feb 19 14:37:47 2007 +0200 KVM: add MSR based hypercall API This adds a special MSR based hypercall API to KVM. This is to be used by paravirtual kernels and virtual drivers. Signed-off-by: Ingo Molnar Signed-off-by: Avi Kivity commit 8aa04bb13cf90d68c26d6bea1e4c720f1f027be0 Author: Markus Rechberger Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Use page_private()/set_page_private() apis Besides using an established api, this allows using kvm in older kernels. Signed-off-by: Markus Rechberger Signed-off-by: Avi Kivity commit 4d5a7e81cc63d28e94373cdeb74dc44045edaa10 Author: Ahmed S. Darwish Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Use ARRAY_SIZE macro instead of manual calculation. Signed-off-by: Ahmed S. Darwish Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 0fe9875fb3f9946a6c1cef6f1b9a286edc8ee2b9 Author: Markus Rechberger Date: Mon Feb 19 14:37:46 2007 +0200 KVM: vmx: hack set_cr0_no_modeswitch() to actually do modeswitch From: Joerg Roedel The whole thing is rotten, but this allows vmx to boot with the guest reboot fix. Signed-off-by: Markus Rechberger Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 7e6e2bbad7f5dbccb389ee6d79be661972b18b15 Author: Avi Kivity Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Cosmetics Signed-off-by: Avi Kivity commit cc66daca849ca8c2900ba8cc7640de664296d36a Author: Jeremy Katz Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Move virtualization deactivation from CPU_DEAD state to CPU_DOWN_PREPARE This gives it more chances of surviving suspend. Signed-off-by: Jeremy Katz Signed-off-by: Avi Kivity commit 2959cd13ecc1fbe1b2339937481844ff963f1e7f Author: Avi Kivity Date: Mon Feb 19 14:37:46 2007 +0200 KVM: mmu: add missing dirty page tracking cases We fail to mark a page dirty in three cases: - setting the accessed bit in a pte - setting the dirty bit in a pte - emulating a write into a pagetable This fix adds the missing cases. Signed-off-by: Avi Kivity drivers/kvm/Kconfig | 1 drivers/kvm/Makefile | 2 drivers/kvm/i8259.c | 450 ++++++++++ drivers/kvm/ioapic.c | 387 ++++++++ drivers/kvm/irq.c | 99 ++ drivers/kvm/irq.h | 165 +++ drivers/kvm/kvm.h | 221 ++--- drivers/kvm/kvm_main.c | 1705 ++++++++++++++++++++++-------------- drivers/kvm/kvm_svm.h | 3 drivers/kvm/lapic.c | 1058 ++++++++++++++++++++++ drivers/kvm/mmu.c | 161 ++- drivers/kvm/paging_tmpl.h | 141 ++- drivers/kvm/svm.c | 1064 +++++++++++----------- drivers/kvm/vmx.c | 1104 ++++++++++++++--------- drivers/kvm/vmx.h | 73 +- drivers/kvm/x86_emulate.c | 1528 ++++++++++++++++++++------------ drivers/kvm/x86_emulate.h | 63 + include/asm-i386/processor-flags.h | 2 include/linux/kvm.h | 108 ++ include/linux/kvm_para.h | 159 ++- 20 files changed, 5892 insertions(+), 2602 deletions(-) diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig index 0a419a0..8749fa4 100644 --- a/drivers/kvm/Kconfig +++ b/drivers/kvm/Kconfig @@ -17,6 +17,7 @@ if VIRTUALIZATION config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on X86 && EXPERIMENTAL + select PREEMPT_NOTIFIERS select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile index c0a789f..e5a8f4d 100644 --- a/drivers/kvm/Makefile +++ b/drivers/kvm/Makefile @@ -2,7 +2,7 @@ # # Makefile for Kernel-based Virtual Machine module # -kvm-objs := kvm_main.o mmu.o x86_emulate.o +kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o irq.o lapic.o ioapic.o obj-$(CONFIG_KVM) += kvm.o kvm-intel-objs = vmx.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c new file mode 100644 index 0000000..a679157 --- /dev/null +++ b/drivers/kvm/i8259.c @@ -0,0 +1,450 @@ +/* + * 8259 interrupt controller emulation + * + * Copyright (c) 2003-2004 Fabrice Bellard + * Copyright (c) 2007 Intel Corporation + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + * Authors: + * Yaozu (Eddie) Dong + * Port from Qemu. + */ +#include +#include "irq.h" + +/* + * set irq level. If an edge is detected, then the IRR is set to 1 + */ +static inline void pic_set_irq1(struct kvm_kpic_state *s, int irq, int level) +{ + int mask; + mask = 1 << irq; + if (s->elcr & mask) /* level triggered */ + if (level) { + s->irr |= mask; + s->last_irr |= mask; + } else { + s->irr &= ~mask; + s->last_irr &= ~mask; + } + else /* edge triggered */ + if (level) { + if ((s->last_irr & mask) == 0) + s->irr |= mask; + s->last_irr |= mask; + } else + s->last_irr &= ~mask; +} + +/* + * return the highest priority found in mask (highest = smallest + * number). Return 8 if no irq + */ +static inline int get_priority(struct kvm_kpic_state *s, int mask) +{ + int priority; + if (mask == 0) + return 8; + priority = 0; + while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0) + priority++; + return priority; +} + +/* + * return the pic wanted interrupt. return -1 if none + */ +static int pic_get_irq(struct kvm_kpic_state *s) +{ + int mask, cur_priority, priority; + + mask = s->irr & ~s->imr; + priority = get_priority(s, mask); + if (priority == 8) + return -1; + /* + * compute current priority. If special fully nested mode on the + * master, the IRQ coming from the slave is not taken into account + * for the priority computation. + */ + mask = s->isr; + if (s->special_fully_nested_mode && s == &s->pics_state->pics[0]) + mask &= ~(1 << 2); + cur_priority = get_priority(s, mask); + if (priority < cur_priority) + /* + * higher priority found: an irq should be generated + */ + return (priority + s->priority_add) & 7; + else + return -1; +} + +/* + * raise irq to CPU if necessary. must be called every time the active + * irq may change + */ +static void pic_update_irq(struct kvm_pic *s) +{ + int irq2, irq; + + irq2 = pic_get_irq(&s->pics[1]); + if (irq2 >= 0) { + /* + * if irq request by slave pic, signal master PIC + */ + pic_set_irq1(&s->pics[0], 2, 1); + pic_set_irq1(&s->pics[0], 2, 0); + } + irq = pic_get_irq(&s->pics[0]); + if (irq >= 0) + s->irq_request(s->irq_request_opaque, 1); + else + s->irq_request(s->irq_request_opaque, 0); +} + +void kvm_pic_update_irq(struct kvm_pic *s) +{ + pic_update_irq(s); +} + +void kvm_pic_set_irq(void *opaque, int irq, int level) +{ + struct kvm_pic *s = opaque; + + pic_set_irq1(&s->pics[irq >> 3], irq & 7, level); + pic_update_irq(s); +} + +/* + * acknowledge interrupt 'irq' + */ +static inline void pic_intack(struct kvm_kpic_state *s, int irq) +{ + if (s->auto_eoi) { + if (s->rotate_on_auto_eoi) + s->priority_add = (irq + 1) & 7; + } else + s->isr |= (1 << irq); + /* + * We don't clear a level sensitive interrupt here + */ + if (!(s->elcr & (1 << irq))) + s->irr &= ~(1 << irq); +} + +int kvm_pic_read_irq(struct kvm_pic *s) +{ + int irq, irq2, intno; + + irq = pic_get_irq(&s->pics[0]); + if (irq >= 0) { + pic_intack(&s->pics[0], irq); + if (irq == 2) { + irq2 = pic_get_irq(&s->pics[1]); + if (irq2 >= 0) + pic_intack(&s->pics[1], irq2); + else + /* + * spurious IRQ on slave controller + */ + irq2 = 7; + intno = s->pics[1].irq_base + irq2; + irq = irq2 + 8; + } else + intno = s->pics[0].irq_base + irq; + } else { + /* + * spurious IRQ on host controller + */ + irq = 7; + intno = s->pics[0].irq_base + irq; + } + pic_update_irq(s); + + return intno; +} + +static void pic_reset(void *opaque) +{ + struct kvm_kpic_state *s = opaque; + + s->last_irr = 0; + s->irr = 0; + s->imr = 0; + s->isr = 0; + s->priority_add = 0; + s->irq_base = 0; + s->read_reg_select = 0; + s->poll = 0; + s->special_mask = 0; + s->init_state = 0; + s->auto_eoi = 0; + s->rotate_on_auto_eoi = 0; + s->special_fully_nested_mode = 0; + s->init4 = 0; +} + +static void pic_ioport_write(void *opaque, u32 addr, u32 val) +{ + struct kvm_kpic_state *s = opaque; + int priority, cmd, irq; + + addr &= 1; + if (addr == 0) { + if (val & 0x10) { + pic_reset(s); /* init */ + /* + * deassert a pending interrupt + */ + s->pics_state->irq_request(s->pics_state-> + irq_request_opaque, 0); + s->init_state = 1; + s->init4 = val & 1; + if (val & 0x02) + printk(KERN_ERR "single mode not supported"); + if (val & 0x08) + printk(KERN_ERR + "level sensitive irq not supported"); + } else if (val & 0x08) { + if (val & 0x04) + s->poll = 1; + if (val & 0x02) + s->read_reg_select = val & 1; + if (val & 0x40) + s->special_mask = (val >> 5) & 1; + } else { + cmd = val >> 5; + switch (cmd) { + case 0: + case 4: + s->rotate_on_auto_eoi = cmd >> 2; + break; + case 1: /* end of interrupt */ + case 5: + priority = get_priority(s, s->isr); + if (priority != 8) { + irq = (priority + s->priority_add) & 7; + s->isr &= ~(1 << irq); + if (cmd == 5) + s->priority_add = (irq + 1) & 7; + pic_update_irq(s->pics_state); + } + break; + case 3: + irq = val & 7; + s->isr &= ~(1 << irq); + pic_update_irq(s->pics_state); + break; + case 6: + s->priority_add = (val + 1) & 7; + pic_update_irq(s->pics_state); + break; + case 7: + irq = val & 7; + s->isr &= ~(1 << irq); + s->priority_add = (irq + 1) & 7; + pic_update_irq(s->pics_state); + break; + default: + break; /* no operation */ + } + } + } else + switch (s->init_state) { + case 0: /* normal mode */ + s->imr = val; + pic_update_irq(s->pics_state); + break; + case 1: + s->irq_base = val & 0xf8; + s->init_state = 2; + break; + case 2: + if (s->init4) + s->init_state = 3; + else + s->init_state = 0; + break; + case 3: + s->special_fully_nested_mode = (val >> 4) & 1; + s->auto_eoi = (val >> 1) & 1; + s->init_state = 0; + break; + } +} + +static u32 pic_poll_read(struct kvm_kpic_state *s, u32 addr1) +{ + int ret; + + ret = pic_get_irq(s); + if (ret >= 0) { + if (addr1 >> 7) { + s->pics_state->pics[0].isr &= ~(1 << 2); + s->pics_state->pics[0].irr &= ~(1 << 2); + } + s->irr &= ~(1 << ret); + s->isr &= ~(1 << ret); + if (addr1 >> 7 || ret != 2) + pic_update_irq(s->pics_state); + } else { + ret = 0x07; + pic_update_irq(s->pics_state); + } + + return ret; +} + +static u32 pic_ioport_read(void *opaque, u32 addr1) +{ + struct kvm_kpic_state *s = opaque; + unsigned int addr; + int ret; + + addr = addr1; + addr &= 1; + if (s->poll) { + ret = pic_poll_read(s, addr1); + s->poll = 0; + } else + if (addr == 0) + if (s->read_reg_select) + ret = s->isr; + else + ret = s->irr; + else + ret = s->imr; + return ret; +} + +static void elcr_ioport_write(void *opaque, u32 addr, u32 val) +{ + struct kvm_kpic_state *s = opaque; + s->elcr = val & s->elcr_mask; +} + +static u32 elcr_ioport_read(void *opaque, u32 addr1) +{ + struct kvm_kpic_state *s = opaque; + return s->elcr; +} + +static int picdev_in_range(struct kvm_io_device *this, gpa_t addr) +{ + switch (addr) { + case 0x20: + case 0x21: + case 0xa0: + case 0xa1: + case 0x4d0: + case 0x4d1: + return 1; + default: + return 0; + } +} + +static void picdev_write(struct kvm_io_device *this, + gpa_t addr, int len, const void *val) +{ + struct kvm_pic *s = this->private; + unsigned char data = *(unsigned char *)val; + + if (len != 1) { + if (printk_ratelimit()) + printk(KERN_ERR "PIC: non byte write\n"); + return; + } + switch (addr) { + case 0x20: + case 0x21: + case 0xa0: + case 0xa1: + pic_ioport_write(&s->pics[addr >> 7], addr, data); + break; + case 0x4d0: + case 0x4d1: + elcr_ioport_write(&s->pics[addr & 1], addr, data); + break; + } +} + +static void picdev_read(struct kvm_io_device *this, + gpa_t addr, int len, void *val) +{ + struct kvm_pic *s = this->private; + unsigned char data = 0; + + if (len != 1) { + if (printk_ratelimit()) + printk(KERN_ERR "PIC: non byte read\n"); + return; + } + switch (addr) { + case 0x20: + case 0x21: + case 0xa0: + case 0xa1: + data = pic_ioport_read(&s->pics[addr >> 7], addr); + break; + case 0x4d0: + case 0x4d1: + data = elcr_ioport_read(&s->pics[addr & 1], addr); + break; + } + *(unsigned char *)val = data; +} + +/* + * callback when PIC0 irq status changed + */ +static void pic_irq_request(void *opaque, int level) +{ + struct kvm *kvm = opaque; + struct kvm_vcpu *vcpu = kvm->vcpus[0]; + + pic_irqchip(kvm)->output = level; + if (vcpu) + kvm_vcpu_kick(vcpu); +} + +struct kvm_pic *kvm_create_pic(struct kvm *kvm) +{ + struct kvm_pic *s; + s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL); + if (!s) + return NULL; + s->pics[0].elcr_mask = 0xf8; + s->pics[1].elcr_mask = 0xde; + s->irq_request = pic_irq_request; + s->irq_request_opaque = kvm; + s->pics[0].pics_state = s; + s->pics[1].pics_state = s; + + /* + * Initialize PIO device + */ + s->dev.read = picdev_read; + s->dev.write = picdev_write; + s->dev.in_range = picdev_in_range; + s->dev.private = s; + kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev); + return s; +} diff --git a/drivers/kvm/ioapic.c b/drivers/kvm/ioapic.c new file mode 100644 index 0000000..9eb5058 --- /dev/null +++ b/drivers/kvm/ioapic.c @@ -0,0 +1,387 @@ +/* + * Copyright (C) 2001 MandrakeSoft S.A. + * + * MandrakeSoft S.A. + * 43, rue d'Aboukir + * 75002 Paris - France + * http://www.linux-mandrake.com/ + * http://www.mandrakesoft.com/ + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Yunhong Jiang + * Yaozu (Eddie) Dong + * Based on Xen 3.1 code. + */ + +#include "kvm.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "irq.h" +/* #define ioapic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg) */ +#define ioapic_debug(fmt, arg...) +static void ioapic_deliver(struct kvm_ioapic *vioapic, int irq); + +static unsigned long ioapic_read_indirect(struct kvm_ioapic *ioapic, + unsigned long addr, + unsigned long length) +{ + unsigned long result = 0; + + switch (ioapic->ioregsel) { + case IOAPIC_REG_VERSION: + result = ((((IOAPIC_NUM_PINS - 1) & 0xff) << 16) + | (IOAPIC_VERSION_ID & 0xff)); + break; + + case IOAPIC_REG_APIC_ID: + case IOAPIC_REG_ARB_ID: + result = ((ioapic->id & 0xf) << 24); + break; + + default: + { + u32 redir_index = (ioapic->ioregsel - 0x10) >> 1; + u64 redir_content; + + ASSERT(redir_index < IOAPIC_NUM_PINS); + + redir_content = ioapic->redirtbl[redir_index].bits; + result = (ioapic->ioregsel & 0x1) ? + (redir_content >> 32) & 0xffffffff : + redir_content & 0xffffffff; + break; + } + } + + return result; +} + +static void ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx) +{ + union ioapic_redir_entry *pent; + + pent = &ioapic->redirtbl[idx]; + + if (!pent->fields.mask) { + ioapic_deliver(ioapic, idx); + if (pent->fields.trig_mode == IOAPIC_LEVEL_TRIG) + pent->fields.remote_irr = 1; + } + if (!pent->fields.trig_mode) + ioapic->irr &= ~(1 << idx); +} + +static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val) +{ + int index; + + switch (ioapic->ioregsel) { + case IOAPIC_REG_VERSION: + /* Writes are ignored. */ + break; + + case IOAPIC_REG_APIC_ID: + ioapic->id = (val >> 24) & 0xf; + break; + + case IOAPIC_REG_ARB_ID: + break; + + default: + index = (ioapic->ioregsel - 0x10) >> 1; + + ioapic_debug("change redir index %x val %x", index, val); + ASSERT(irq < IOAPIC_NUM_PINS); + if (ioapic->ioregsel & 1) { + ioapic->redirtbl[index].bits &= 0xffffffff; + ioapic->redirtbl[index].bits |= (u64) val << 32; + } else { + ioapic->redirtbl[index].bits &= ~0xffffffffULL; + ioapic->redirtbl[index].bits |= (u32) val; + ioapic->redirtbl[index].fields.remote_irr = 0; + } + if (ioapic->irr & (1 << index)) + ioapic_service(ioapic, index); + break; + } +} + +static void ioapic_inj_irq(struct kvm_ioapic *ioapic, + struct kvm_lapic *target, + u8 vector, u8 trig_mode, u8 delivery_mode) +{ + ioapic_debug("irq %d trig %d deliv %d", vector, trig_mode, + delivery_mode); + + ASSERT((delivery_mode == dest_Fixed) || + (delivery_mode == dest_LowestPrio)); + + kvm_apic_set_irq(target, vector, trig_mode); +} + +static u32 ioapic_get_delivery_bitmask(struct kvm_ioapic *ioapic, u8 dest, + u8 dest_mode) +{ + u32 mask = 0; + int i; + struct kvm *kvm = ioapic->kvm; + struct kvm_vcpu *vcpu; + + ioapic_debug("dest %d dest_mode %d", dest, dest_mode); + + if (dest_mode == 0) { /* Physical mode. */ + if (dest == 0xFF) { /* Broadcast. */ + for (i = 0; i < KVM_MAX_VCPUS; ++i) + if (kvm->vcpus[i] && kvm->vcpus[i]->apic) + mask |= 1 << i; + return mask; + } + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + vcpu = kvm->vcpus[i]; + if (!vcpu) + continue; + if (kvm_apic_match_physical_addr(vcpu->apic, dest)) { + if (vcpu->apic) + mask = 1 << i; + break; + } + } + } else if (dest != 0) /* Logical mode, MDA non-zero. */ + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + vcpu = kvm->vcpus[i]; + if (!vcpu) + continue; + if (vcpu->apic && + kvm_apic_match_logical_addr(vcpu->apic, dest)) + mask |= 1 << vcpu->vcpu_id; + } + ioapic_debug("mask %x", mask); + return mask; +} + +static void ioapic_deliver(struct kvm_ioapic *ioapic, int irq) +{ + u8 dest = ioapic->redirtbl[irq].fields.dest_id; + u8 dest_mode = ioapic->redirtbl[irq].fields.dest_mode; + u8 delivery_mode = ioapic->redirtbl[irq].fields.delivery_mode; + u8 vector = ioapic->redirtbl[irq].fields.vector; + u8 trig_mode = ioapic->redirtbl[irq].fields.trig_mode; + u32 deliver_bitmask; + struct kvm_lapic *target; + struct kvm_vcpu *vcpu; + int vcpu_id; + + ioapic_debug("dest=%x dest_mode=%x delivery_mode=%x " + "vector=%x trig_mode=%x", + dest, dest_mode, delivery_mode, vector, trig_mode); + + deliver_bitmask = ioapic_get_delivery_bitmask(ioapic, dest, dest_mode); + if (!deliver_bitmask) { + ioapic_debug("no target on destination"); + return; + } + + switch (delivery_mode) { + case dest_LowestPrio: + target = + kvm_apic_round_robin(ioapic->kvm, vector, deliver_bitmask); + if (target != NULL) + ioapic_inj_irq(ioapic, target, vector, + trig_mode, delivery_mode); + else + ioapic_debug("null round robin: " + "mask=%x vector=%x delivery_mode=%x", + deliver_bitmask, vector, dest_LowestPrio); + break; + case dest_Fixed: + for (vcpu_id = 0; deliver_bitmask != 0; vcpu_id++) { + if (!(deliver_bitmask & (1 << vcpu_id))) + continue; + deliver_bitmask &= ~(1 << vcpu_id); + vcpu = ioapic->kvm->vcpus[vcpu_id]; + if (vcpu) { + target = vcpu->apic; + ioapic_inj_irq(ioapic, target, vector, + trig_mode, delivery_mode); + } + } + break; + + /* TODO: NMI */ + default: + printk(KERN_WARNING "Unsupported delivery mode %d\n", + delivery_mode); + break; + } +} + +void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +{ + u32 old_irr = ioapic->irr; + u32 mask = 1 << irq; + union ioapic_redir_entry entry; + + if (irq >= 0 && irq < IOAPIC_NUM_PINS) { + entry = ioapic->redirtbl[irq]; + level ^= entry.fields.polarity; + if (!level) + ioapic->irr &= ~mask; + else { + ioapic->irr |= mask; + if ((!entry.fields.trig_mode && old_irr != ioapic->irr) + || !entry.fields.remote_irr) + ioapic_service(ioapic, irq); + } + } +} + +static int get_eoi_gsi(struct kvm_ioapic *ioapic, int vector) +{ + int i; + + for (i = 0; i < IOAPIC_NUM_PINS; i++) + if (ioapic->redirtbl[i].fields.vector == vector) + return i; + return -1; +} + +void kvm_ioapic_update_eoi(struct kvm *kvm, int vector) +{ + struct kvm_ioapic *ioapic = kvm->vioapic; + union ioapic_redir_entry *ent; + int gsi; + + gsi = get_eoi_gsi(ioapic, vector); + if (gsi == -1) { + printk(KERN_WARNING "Can't find redir item for %d EOI\n", + vector); + return; + } + + ent = &ioapic->redirtbl[gsi]; + ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG); + + ent->fields.remote_irr = 0; + if (!ent->fields.mask && (ioapic->irr & (1 << gsi))) + ioapic_deliver(ioapic, gsi); +} + +static int ioapic_in_range(struct kvm_io_device *this, gpa_t addr) +{ + struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private; + + return ((addr >= ioapic->base_address && + (addr < ioapic->base_address + IOAPIC_MEM_LENGTH))); +} + +static void ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, + void *val) +{ + struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private; + u32 result; + + ioapic_debug("addr %lx", (unsigned long)addr); + ASSERT(!(addr & 0xf)); /* check alignment */ + + addr &= 0xff; + switch (addr) { + case IOAPIC_REG_SELECT: + result = ioapic->ioregsel; + break; + + case IOAPIC_REG_WINDOW: + result = ioapic_read_indirect(ioapic, addr, len); + break; + + default: + result = 0; + break; + } + switch (len) { + case 8: + *(u64 *) val = result; + break; + case 1: + case 2: + case 4: + memcpy(val, (char *)&result, len); + break; + default: + printk(KERN_WARNING "ioapic: wrong length %d\n", len); + } +} + +static void ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, + const void *val) +{ + struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private; + u32 data; + + ioapic_debug("ioapic_mmio_write addr=%lx len=%d val=%p\n", + addr, len, val); + ASSERT(!(addr & 0xf)); /* check alignment */ + if (len == 4 || len == 8) + data = *(u32 *) val; + else { + printk(KERN_WARNING "ioapic: Unsupported size %d\n", len); + return; + } + + addr &= 0xff; + switch (addr) { + case IOAPIC_REG_SELECT: + ioapic->ioregsel = data; + break; + + case IOAPIC_REG_WINDOW: + ioapic_write_indirect(ioapic, data); + break; + + default: + break; + } +} + +int kvm_ioapic_init(struct kvm *kvm) +{ + struct kvm_ioapic *ioapic; + int i; + + ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL); + if (!ioapic) + return -ENOMEM; + kvm->vioapic = ioapic; + for (i = 0; i < IOAPIC_NUM_PINS; i++) + ioapic->redirtbl[i].fields.mask = 1; + ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS; + ioapic->dev.read = ioapic_mmio_read; + ioapic->dev.write = ioapic_mmio_write; + ioapic->dev.in_range = ioapic_in_range; + ioapic->dev.private = ioapic; + ioapic->kvm = kvm; + kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev); + return 0; +} diff --git a/drivers/kvm/irq.c b/drivers/kvm/irq.c new file mode 100644 index 0000000..0f663fe --- /dev/null +++ b/drivers/kvm/irq.c @@ -0,0 +1,99 @@ +/* + * irq.c: API for in kernel interrupt controller + * Copyright (c) 2007, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * Authors: + * Yaozu (Eddie) Dong + * + */ + +#include + +#include "kvm.h" +#include "irq.h" + +/* + * check if there is pending interrupt without + * intack. + */ +int kvm_cpu_has_interrupt(struct kvm_vcpu *v) +{ + struct kvm_pic *s; + + if (kvm_apic_has_interrupt(v) == -1) { /* LAPIC */ + if (kvm_apic_accept_pic_intr(v)) { + s = pic_irqchip(v->kvm); /* PIC */ + return s->output; + } else + return 0; + } + return 1; +} +EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt); + +/* + * Read pending interrupt vector and intack. + */ +int kvm_cpu_get_interrupt(struct kvm_vcpu *v) +{ + struct kvm_pic *s; + int vector; + + vector = kvm_get_apic_interrupt(v); /* APIC */ + if (vector == -1) { + if (kvm_apic_accept_pic_intr(v)) { + s = pic_irqchip(v->kvm); + s->output = 0; /* PIC */ + vector = kvm_pic_read_irq(s); + } + } + return vector; +} +EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt); + +static void vcpu_kick_intr(void *info) +{ +#ifdef DEBUG + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)info; + printk(KERN_DEBUG "vcpu_kick_intr %p \n", vcpu); +#endif +} + +void kvm_vcpu_kick(struct kvm_vcpu *vcpu) +{ + int ipi_pcpu = vcpu->cpu; + + if (waitqueue_active(&vcpu->wq)) { + wake_up_interruptible(&vcpu->wq); + ++vcpu->stat.halt_wakeup; + } + if (vcpu->guest_mode) + smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0, 0); +} + +void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu) +{ + kvm_inject_apic_timer_irqs(vcpu); + /* TODO: PIT, RTC etc. */ +} +EXPORT_SYMBOL_GPL(kvm_inject_pending_timer_irqs); + +void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec) +{ + kvm_apic_timer_intr_post(vcpu, vec); + /* TODO: PIT, RTC etc. */ +} +EXPORT_SYMBOL_GPL(kvm_timer_intr_post); + diff --git a/drivers/kvm/irq.h b/drivers/kvm/irq.h new file mode 100644 index 0000000..11fc014 --- /dev/null +++ b/drivers/kvm/irq.h @@ -0,0 +1,165 @@ +/* + * irq.h: in kernel interrupt controller related definitions + * Copyright (c) 2007, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * Authors: + * Yaozu (Eddie) Dong + * + */ + +#ifndef __IRQ_H +#define __IRQ_H + +#include "kvm.h" + +typedef void irq_request_func(void *opaque, int level); + +struct kvm_kpic_state { + u8 last_irr; /* edge detection */ + u8 irr; /* interrupt request register */ + u8 imr; /* interrupt mask register */ + u8 isr; /* interrupt service register */ + u8 priority_add; /* highest irq priority */ + u8 irq_base; + u8 read_reg_select; + u8 poll; + u8 special_mask; + u8 init_state; + u8 auto_eoi; + u8 rotate_on_auto_eoi; + u8 special_fully_nested_mode; + u8 init4; /* true if 4 byte init */ + u8 elcr; /* PIIX edge/trigger selection */ + u8 elcr_mask; + struct kvm_pic *pics_state; +}; + +struct kvm_pic { + struct kvm_kpic_state pics[2]; /* 0 is master pic, 1 is slave pic */ + irq_request_func *irq_request; + void *irq_request_opaque; + int output; /* intr from master PIC */ + struct kvm_io_device dev; +}; + +struct kvm_pic *kvm_create_pic(struct kvm *kvm); +void kvm_pic_set_irq(void *opaque, int irq, int level); +int kvm_pic_read_irq(struct kvm_pic *s); +int kvm_cpu_get_interrupt(struct kvm_vcpu *v); +int kvm_cpu_has_interrupt(struct kvm_vcpu *v); +void kvm_pic_update_irq(struct kvm_pic *s); + +#define IOAPIC_NUM_PINS KVM_IOAPIC_NUM_PINS +#define IOAPIC_VERSION_ID 0x11 /* IOAPIC version */ +#define IOAPIC_EDGE_TRIG 0 +#define IOAPIC_LEVEL_TRIG 1 + +#define IOAPIC_DEFAULT_BASE_ADDRESS 0xfec00000 +#define IOAPIC_MEM_LENGTH 0x100 + +/* Direct registers. */ +#define IOAPIC_REG_SELECT 0x00 +#define IOAPIC_REG_WINDOW 0x10 +#define IOAPIC_REG_EOI 0x40 /* IA64 IOSAPIC only */ + +/* Indirect registers. */ +#define IOAPIC_REG_APIC_ID 0x00 /* x86 IOAPIC only */ +#define IOAPIC_REG_VERSION 0x01 +#define IOAPIC_REG_ARB_ID 0x02 /* x86 IOAPIC only */ + +struct kvm_ioapic { + u64 base_address; + u32 ioregsel; + u32 id; + u32 irr; + u32 pad; + union ioapic_redir_entry { + u64 bits; + struct { + u8 vector; + u8 delivery_mode:3; + u8 dest_mode:1; + u8 delivery_status:1; + u8 polarity:1; + u8 remote_irr:1; + u8 trig_mode:1; + u8 mask:1; + u8 reserve:7; + u8 reserved[4]; + u8 dest_id; + } fields; + } redirtbl[IOAPIC_NUM_PINS]; + struct kvm_io_device dev; + struct kvm *kvm; +}; + +struct kvm_lapic { + unsigned long base_address; + struct kvm_io_device dev; + struct { + atomic_t pending; + s64 period; /* unit: ns */ + u32 divide_count; + ktime_t last_update; + struct hrtimer dev; + } timer; + struct kvm_vcpu *vcpu; + struct page *regs_page; + void *regs; +}; + +#ifdef DEBUG +#define ASSERT(x) \ +do { \ + if (!(x)) { \ + printk(KERN_EMERG "assertion failed %s: %d: %s\n", \ + __FILE__, __LINE__, #x); \ + BUG(); \ + } \ +} while (0) +#else +#define ASSERT(x) do { } while (0) +#endif + +void kvm_vcpu_kick(struct kvm_vcpu *vcpu); +int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu); +int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu); +int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); +int kvm_create_lapic(struct kvm_vcpu *vcpu); +void kvm_lapic_reset(struct kvm_vcpu *vcpu); +void kvm_free_apic(struct kvm_lapic *apic); +u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); +void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); +struct kvm_lapic *kvm_apic_round_robin(struct kvm *kvm, u8 vector, + unsigned long bitmap); +u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); +void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data); +int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest); +void kvm_ioapic_update_eoi(struct kvm *kvm, int vector); +int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda); +int kvm_apic_set_irq(struct kvm_lapic *apic, u8 vec, u8 trig); +void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu); +int kvm_ioapic_init(struct kvm *kvm); +void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level); +int kvm_lapic_enabled(struct kvm_vcpu *vcpu); +int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); +void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec); +void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec); +void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu); +void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu); +void kvm_migrate_apic_timer(struct kvm_vcpu *vcpu); + +#endif diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 336be86..051cdbe 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -13,61 +13,40 @@ #include #include #include #include +#include #include -#include "vmx.h" #include #include -#define CR0_PE_MASK (1ULL << 0) -#define CR0_MP_MASK (1ULL << 1) -#define CR0_TS_MASK (1ULL << 3) -#define CR0_NE_MASK (1ULL << 5) -#define CR0_WP_MASK (1ULL << 16) -#define CR0_NW_MASK (1ULL << 29) -#define CR0_CD_MASK (1ULL << 30) -#define CR0_PG_MASK (1ULL << 31) - -#define CR3_WPT_MASK (1ULL << 3) -#define CR3_PCD_MASK (1ULL << 4) - -#define CR3_RESEVED_BITS 0x07ULL -#define CR3_L_MODE_RESEVED_BITS (~((1ULL << 40) - 1) | 0x0fe7ULL) -#define CR3_FLAGS_MASK ((1ULL << 5) - 1) - -#define CR4_VME_MASK (1ULL << 0) -#define CR4_PSE_MASK (1ULL << 4) -#define CR4_PAE_MASK (1ULL << 5) -#define CR4_PGE_MASK (1ULL << 7) -#define CR4_VMXE_MASK (1ULL << 13) +#define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1) +#define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD)) +#define CR3_L_MODE_RESERVED_BITS (CR3_NONPAE_RESERVED_BITS|0xFFFFFF0000000000ULL) #define KVM_GUEST_CR0_MASK \ - (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK \ - | CR0_NW_MASK | CR0_CD_MASK) + (X86_CR0_PG | X86_CR0_PE | X86_CR0_WP | X86_CR0_NE \ + | X86_CR0_NW | X86_CR0_CD) #define KVM_VM_CR0_ALWAYS_ON \ - (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK | CR0_TS_MASK \ - | CR0_MP_MASK) + (X86_CR0_PG | X86_CR0_PE | X86_CR0_WP | X86_CR0_NE | X86_CR0_TS \ + | X86_CR0_MP) #define KVM_GUEST_CR4_MASK \ - (CR4_PSE_MASK | CR4_PAE_MASK | CR4_PGE_MASK | CR4_VMXE_MASK | CR4_VME_MASK) -#define KVM_PMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK) -#define KVM_RMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK | CR4_VME_MASK) + (X86_CR4_VME | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_PGE | X86_CR4_VMXE) +#define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE) +#define KVM_RMODE_VM_CR4_ALWAYS_ON (X86_CR4_VME | X86_CR4_PAE | X86_CR4_VMXE) #define INVALID_PAGE (~(hpa_t)0) #define UNMAPPED_GVA (~(gpa_t)0) #define KVM_MAX_VCPUS 4 #define KVM_ALIAS_SLOTS 4 -#define KVM_MEMORY_SLOTS 4 +#define KVM_MEMORY_SLOTS 8 #define KVM_NUM_MMU_PAGES 1024 #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 40 -#define FX_IMAGE_SIZE 512 -#define FX_IMAGE_ALIGN 16 -#define FX_BUF_SIZE (2 * FX_IMAGE_SIZE + FX_IMAGE_ALIGN) - #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -158,15 +137,8 @@ struct kvm_mmu_page { }; }; -struct vmcs { - u32 revision_id; - u32 abort; - char data[0]; -}; - -#define vmx_msr_entry kvm_msr_entry - struct kvm_vcpu; +extern struct kmem_cache *kvm_vcpu_cache; /* * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level @@ -178,6 +150,8 @@ struct kvm_mmu { int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err); void (*free)(struct kvm_vcpu *vcpu); gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva); + void (*prefetch_page)(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *page); hpa_t root_hpa; int root_level; int shadow_root_level; @@ -235,6 +209,8 @@ enum { VCPU_SREG_LDTR, }; +#include "x86_emulate.h" + struct kvm_pio_request { unsigned long count; int cur_count; @@ -260,6 +236,7 @@ struct kvm_stat { u32 signal_exits; u32 irq_window_exits; u32 halt_exits; + u32 halt_wakeup; u32 request_irq_exits; u32 irq_exits; u32 light_exits; @@ -328,44 +305,37 @@ void kvm_io_bus_register_dev(struct kvm_ struct kvm_vcpu { struct kvm *kvm; - union { - struct vmcs *vmcs; - struct vcpu_svm *svm; - }; + struct preempt_notifier preempt_notifier; + int vcpu_id; struct mutex mutex; int cpu; - int launched; u64 host_tsc; struct kvm_run *run; int interrupt_window_open; int guest_mode; unsigned long requests; unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */ -#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) - unsigned long irq_pending[NR_IRQ_WORDS]; + DECLARE_BITMAP(irq_pending, KVM_NR_INTERRUPTS); unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */ unsigned long rip; /* needs vcpu_load_rsp_rip() */ unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ u64 shadow_efer; u64 apic_base; + struct kvm_lapic *apic; /* kernel irqchip context */ +#define VCPU_MP_STATE_RUNNABLE 0 +#define VCPU_MP_STATE_UNINITIALIZED 1 +#define VCPU_MP_STATE_INIT_RECEIVED 2 +#define VCPU_MP_STATE_SIPI_RECEIVED 3 +#define VCPU_MP_STATE_HALTED 4 + int mp_state; + int sipi_vector; u64 ia32_misc_enable_msr; - int nmsrs; - int save_nmsrs; - int msr_offset_efer; -#ifdef CONFIG_X86_64 - int msr_offset_kernel_gs_base; -#endif - struct vmx_msr_entry *guest_msrs; - struct vmx_msr_entry *host_msrs; struct kvm_mmu mmu; @@ -376,19 +346,14 @@ #endif gfn_t last_pt_write_gfn; int last_pt_write_count; + u64 *last_pte_updated; struct kvm_guest_debug guest_debug; - char fx_buf[FX_BUF_SIZE]; - char *host_fx_image; - char *guest_fx_image; + struct i387_fxsave_struct host_fx_image; + struct i387_fxsave_struct guest_fx_image; int fpu_active; int guest_fpu_loaded; - struct vmx_host_state { - int loaded; - u16 fs_sel, gs_sel, ldt_sel; - int fs_gs_ldt_reload_needed; - } vmx_host_state; int mmio_needed; int mmio_read_completed; @@ -399,6 +364,7 @@ #endif gva_t mmio_fault_cr2; struct kvm_pio_request pio; void *pio_data; + wait_queue_head_t wq; int sigset_active; sigset_t sigset; @@ -419,6 +385,10 @@ #endif int cpuid_nent; struct kvm_cpuid_entry cpuid_entries[KVM_MAX_CPUID_ENTRIES]; + + /* emulate context */ + + struct x86_emulate_ctxt emulate_ctxt; }; struct kvm_mem_alias { @@ -436,7 +406,7 @@ struct kvm_memory_slot { }; struct kvm { - spinlock_t lock; /* protects everything except vcpus */ + struct mutex lock; /* protects everything except vcpus */ int naliases; struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS]; int nmemslots; @@ -447,39 +417,59 @@ struct kvm { struct list_head active_mmu_pages; int n_free_mmu_pages; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; - int nvcpus; - struct kvm_vcpu vcpus[KVM_MAX_VCPUS]; - int memory_config_version; - int busy; + struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; unsigned long rmap_overflow; struct list_head vm_list; struct file *filp; struct kvm_io_bus mmio_bus; struct kvm_io_bus pio_bus; + struct kvm_pic *vpic; + struct kvm_ioapic *vioapic; + int round_robin_prev_vcpu; }; +static inline struct kvm_pic *pic_irqchip(struct kvm *kvm) +{ + return kvm->vpic; +} + +static inline struct kvm_ioapic *ioapic_irqchip(struct kvm *kvm) +{ + return kvm->vioapic; +} + +static inline int irqchip_in_kernel(struct kvm *kvm) +{ + return pic_irqchip(kvm) != 0; +} + struct descriptor_table { u16 limit; unsigned long base; } __attribute__((packed)); -struct kvm_arch_ops { +struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ void (*hardware_enable)(void *dummy); /* __init */ void (*hardware_disable)(void *dummy); + void (*check_processor_compatibility)(void *rtn); int (*hardware_setup)(void); /* __init */ void (*hardware_unsetup)(void); /* __exit */ - int (*vcpu_create)(struct kvm_vcpu *vcpu); + /* Create, but do not attach this VCPU */ + struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id); void (*vcpu_free)(struct kvm_vcpu *vcpu); + void (*vcpu_reset)(struct kvm_vcpu *vcpu); - void (*vcpu_load)(struct kvm_vcpu *vcpu); + void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); + void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); void (*vcpu_put)(struct kvm_vcpu *vcpu); void (*vcpu_decache)(struct kvm_vcpu *vcpu); int (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_debug_guest *dbg); + void (*guest_debug_pre)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); @@ -505,27 +495,43 @@ struct kvm_arch_ops { unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags); - void (*invlpg)(struct kvm_vcpu *vcpu, gva_t addr); void (*tlb_flush)(struct kvm_vcpu *vcpu); void (*inject_page_fault)(struct kvm_vcpu *vcpu, unsigned long addr, u32 err_code); void (*inject_gp)(struct kvm_vcpu *vcpu, unsigned err_code); - int (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run); - int (*vcpu_setup)(struct kvm_vcpu *vcpu); + void (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run); + int (*handle_exit)(struct kvm_run *run, struct kvm_vcpu *vcpu); void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); void (*patch_hypercall)(struct kvm_vcpu *vcpu, unsigned char *hypercall_addr); + int (*get_irq)(struct kvm_vcpu *vcpu); + void (*set_irq)(struct kvm_vcpu *vcpu, int vec); + void (*inject_pending_irq)(struct kvm_vcpu *vcpu); + void (*inject_pending_vectors)(struct kvm_vcpu *vcpu, + struct kvm_run *run); }; -extern struct kvm_arch_ops *kvm_arch_ops; +extern struct kvm_x86_ops *kvm_x86_ops; + +/* The guest did something we don't support. */ +#define pr_unimpl(vcpu, fmt, ...) \ + do { \ + if (printk_ratelimit()) \ + printk(KERN_ERR "kvm: %i: cpu%i " fmt, \ + current->tgid, (vcpu)->vcpu_id , ## __VA_ARGS__); \ + } while(0) #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt) #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt) -int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module); -void kvm_exit_arch(void); +int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id); +void kvm_vcpu_uninit(struct kvm_vcpu *vcpu); + +int kvm_init_x86(struct kvm_x86_ops *ops, unsigned int vcpu_size, + struct module *module); +void kvm_exit_x86(void); int kvm_mmu_module_init(void); void kvm_mmu_module_exit(void); @@ -533,6 +539,7 @@ void kvm_mmu_module_exit(void); void kvm_mmu_destroy(struct kvm_vcpu *vcpu); int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); +void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); @@ -545,8 +552,6 @@ static inline int is_error_hpa(hpa_t hpa hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva); struct page *gva_to_page(struct kvm_vcpu *vcpu, gva_t gva); -void kvm_emulator_want_group7_invlpg(void); - extern hpa_t bad_page_address; struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); @@ -560,7 +565,8 @@ enum emulation_result { }; int emulate_instruction(struct kvm_vcpu *vcpu, struct kvm_run *run, - unsigned long cr2, u16 error_code); + unsigned long cr2, u16 error_code, int no_decode); +void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, @@ -574,9 +580,11 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u struct x86_emulate_ctxt; -int kvm_setup_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, - int size, unsigned long count, int string, int down, - gva_t address, int rep, unsigned port); +int kvm_emulate_pio (struct kvm_vcpu *vcpu, struct kvm_run *run, int in, + int size, unsigned port); +int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, + int size, unsigned long count, int down, + gva_t address, int rep, unsigned port); void kvm_emulate_cpuid(struct kvm_vcpu *vcpu); int kvm_emulate_halt(struct kvm_vcpu *vcpu); int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address); @@ -590,40 +598,41 @@ void set_cr0(struct kvm_vcpu *vcpu, unsi void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr0); void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr0); void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr0); +unsigned long get_cr8(struct kvm_vcpu *vcpu); void lmsw(struct kvm_vcpu *vcpu, unsigned long msw); +void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); void fx_init(struct kvm_vcpu *vcpu); -void load_msrs(struct vmx_msr_entry *e, int n); -void save_msrs(struct vmx_msr_entry *e, int n); void kvm_resched(struct kvm_vcpu *vcpu); void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); void kvm_flush_remote_tlbs(struct kvm *kvm); -int kvm_read_guest(struct kvm_vcpu *vcpu, - gva_t addr, - unsigned long size, - void *dest); - -int kvm_write_guest(struct kvm_vcpu *vcpu, - gva_t addr, - unsigned long size, - void *data); +int emulator_read_std(unsigned long addr, + void *val, + unsigned int bytes, + struct kvm_vcpu *vcpu); +int emulator_write_emulated(unsigned long addr, + const void *val, + unsigned int bytes, + struct kvm_vcpu *vcpu); unsigned long segment_base(u16 selector); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *old, const u8 *new, int bytes); + const u8 *new, int bytes); int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) @@ -656,17 +665,17 @@ #endif static inline int is_pae(struct kvm_vcpu *vcpu) { - return vcpu->cr4 & CR4_PAE_MASK; + return vcpu->cr4 & X86_CR4_PAE; } static inline int is_pse(struct kvm_vcpu *vcpu) { - return vcpu->cr4 & CR4_PSE_MASK; + return vcpu->cr4 & X86_CR4_PSE; } static inline int is_paging(struct kvm_vcpu *vcpu) { - return vcpu->cr0 & CR0_PG_MASK; + return vcpu->cr0 & X86_CR0_PG; } static inline int memslot_id(struct kvm *kvm, struct kvm_memory_slot *slot) @@ -746,12 +755,12 @@ static inline unsigned long read_msr(uns } #endif -static inline void fx_save(void *image) +static inline void fx_save(struct i387_fxsave_struct *image) { asm ("fxsave (%0)":: "r" (image)); } -static inline void fx_restore(void *image) +static inline void fx_restore(struct i387_fxsave_struct *image) { asm ("fxrstor (%0)":: "r" (image)); } diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index cd05579..60798e3 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -18,6 +18,7 @@ #include "kvm.h" #include "x86_emulate.h" #include "segment_descriptor.h" +#include "irq.h" #include #include @@ -37,6 +38,8 @@ #include #include #include #include +#include +#include #include #include @@ -52,9 +55,11 @@ static LIST_HEAD(vm_list); static cpumask_t cpus_hardware_enabled; -struct kvm_arch_ops *kvm_arch_ops; +struct kvm_x86_ops *kvm_x86_ops; +struct kmem_cache *kvm_vcpu_cache; +EXPORT_SYMBOL_GPL(kvm_vcpu_cache); -static void hardware_disable(void *ignored); +static __read_mostly struct preempt_ops kvm_preempt_ops; #define STAT_OFFSET(x) offsetof(struct kvm_vcpu, stat.x) @@ -73,6 +78,7 @@ static struct kvm_stats_debugfs_item { { "signal_exits", STAT_OFFSET(signal_exits) }, { "irq_window", STAT_OFFSET(irq_window_exits) }, { "halt_exits", STAT_OFFSET(halt_exits) }, + { "halt_wakeup", STAT_OFFSET(halt_wakeup) }, { "request_irq", STAT_OFFSET(request_irq_exits) }, { "irq_exits", STAT_OFFSET(irq_exits) }, { "light_exits", STAT_OFFSET(light_exits) }, @@ -84,10 +90,17 @@ static struct dentry *debugfs_dir; #define MAX_IO_MSRS 256 -#define CR0_RESEVED_BITS 0xffffffff1ffaffc0ULL -#define LMSW_GUEST_MASK 0x0eULL -#define CR4_RESEVED_BITS (~((1ULL << 11) - 1)) -#define CR8_RESEVED_BITS (~0x0fULL) +#define CR0_RESERVED_BITS \ + (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ + | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ + | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG)) +#define CR4_RESERVED_BITS \ + (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\ + | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ + | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR \ + | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE)) + +#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) #define EFER_RESERVED_BITS 0xfffffffffffff2fe #ifdef CONFIG_X86_64 @@ -139,82 +152,14 @@ static inline int valid_vcpu(int n) return likely(n >= 0 && n < KVM_MAX_VCPUS); } -int kvm_read_guest(struct kvm_vcpu *vcpu, gva_t addr, unsigned long size, - void *dest) -{ - unsigned char *host_buf = dest; - unsigned long req_size = size; - - while (size) { - hpa_t paddr; - unsigned now; - unsigned offset; - hva_t guest_buf; - - paddr = gva_to_hpa(vcpu, addr); - - if (is_error_hpa(paddr)) - break; - - guest_buf = (hva_t)kmap_atomic( - pfn_to_page(paddr >> PAGE_SHIFT), - KM_USER0); - offset = addr & ~PAGE_MASK; - guest_buf |= offset; - now = min(size, PAGE_SIZE - offset); - memcpy(host_buf, (void*)guest_buf, now); - host_buf += now; - addr += now; - size -= now; - kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0); - } - return req_size - size; -} -EXPORT_SYMBOL_GPL(kvm_read_guest); - -int kvm_write_guest(struct kvm_vcpu *vcpu, gva_t addr, unsigned long size, - void *data) -{ - unsigned char *host_buf = data; - unsigned long req_size = size; - - while (size) { - hpa_t paddr; - unsigned now; - unsigned offset; - hva_t guest_buf; - gfn_t gfn; - - paddr = gva_to_hpa(vcpu, addr); - - if (is_error_hpa(paddr)) - break; - - gfn = vcpu->mmu.gva_to_gpa(vcpu, addr) >> PAGE_SHIFT; - mark_page_dirty(vcpu->kvm, gfn); - guest_buf = (hva_t)kmap_atomic( - pfn_to_page(paddr >> PAGE_SHIFT), KM_USER0); - offset = addr & ~PAGE_MASK; - guest_buf |= offset; - now = min(size, PAGE_SIZE - offset); - memcpy((void*)guest_buf, host_buf, now); - host_buf += now; - addr += now; - size -= now; - kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0); - } - return req_size - size; -} -EXPORT_SYMBOL_GPL(kvm_write_guest); - void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) { if (!vcpu->fpu_active || vcpu->guest_fpu_loaded) return; vcpu->guest_fpu_loaded = 1; - fx_save(vcpu->host_fx_image); - fx_restore(vcpu->guest_fx_image); + fx_save(&vcpu->host_fx_image); + fx_restore(&vcpu->guest_fx_image); } EXPORT_SYMBOL_GPL(kvm_load_guest_fpu); @@ -224,8 +169,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu * return; vcpu->guest_fpu_loaded = 0; - fx_save(vcpu->guest_fx_image); - fx_restore(vcpu->host_fx_image); + fx_save(&vcpu->guest_fx_image); + fx_restore(&vcpu->host_fx_image); } EXPORT_SYMBOL_GPL(kvm_put_guest_fpu); @@ -234,13 +179,21 @@ EXPORT_SYMBOL_GPL(kvm_put_guest_fpu); */ static void vcpu_load(struct kvm_vcpu *vcpu) { + int cpu; + mutex_lock(&vcpu->mutex); - kvm_arch_ops->vcpu_load(vcpu); + cpu = get_cpu(); + preempt_notifier_register(&vcpu->preempt_notifier); + kvm_x86_ops->vcpu_load(vcpu, cpu); + put_cpu(); } static void vcpu_put(struct kvm_vcpu *vcpu) { - kvm_arch_ops->vcpu_put(vcpu); + preempt_disable(); + kvm_x86_ops->vcpu_put(vcpu); + preempt_notifier_unregister(&vcpu->preempt_notifier); + preempt_enable(); mutex_unlock(&vcpu->mutex); } @@ -261,8 +214,10 @@ void kvm_flush_remote_tlbs(struct kvm *k atomic_set(&completed, 0); cpus_clear(cpus); needed = 0; - for (i = 0; i < kvm->nvcpus; ++i) { - vcpu = &kvm->vcpus[i]; + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + vcpu = kvm->vcpus[i]; + if (!vcpu) + continue; if (test_and_set_bit(KVM_TLB_FLUSH, &vcpu->requests)) continue; cpu = vcpu->cpu; @@ -286,37 +241,79 @@ void kvm_flush_remote_tlbs(struct kvm *k } } +int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) +{ + struct page *page; + int r; + + mutex_init(&vcpu->mutex); + vcpu->cpu = -1; + vcpu->mmu.root_hpa = INVALID_PAGE; + vcpu->kvm = kvm; + vcpu->vcpu_id = id; + if (!irqchip_in_kernel(kvm) || id == 0) + vcpu->mp_state = VCPU_MP_STATE_RUNNABLE; + else + vcpu->mp_state = VCPU_MP_STATE_UNINITIALIZED; + init_waitqueue_head(&vcpu->wq); + + page = alloc_page(GFP_KERNEL | __GFP_ZERO); + if (!page) { + r = -ENOMEM; + goto fail; + } + vcpu->run = page_address(page); + + page = alloc_page(GFP_KERNEL | __GFP_ZERO); + if (!page) { + r = -ENOMEM; + goto fail_free_run; + } + vcpu->pio_data = page_address(page); + + r = kvm_mmu_create(vcpu); + if (r < 0) + goto fail_free_pio_data; + + return 0; + +fail_free_pio_data: + free_page((unsigned long)vcpu->pio_data); +fail_free_run: + free_page((unsigned long)vcpu->run); +fail: + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(kvm_vcpu_init); + +void kvm_vcpu_uninit(struct kvm_vcpu *vcpu) +{ + kvm_mmu_destroy(vcpu); + if (vcpu->apic) + hrtimer_cancel(&vcpu->apic->timer.dev); + kvm_free_apic(vcpu->apic); + free_page((unsigned long)vcpu->pio_data); + free_page((unsigned long)vcpu->run); +} +EXPORT_SYMBOL_GPL(kvm_vcpu_uninit); + static struct kvm *kvm_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); - int i; if (!kvm) return ERR_PTR(-ENOMEM); kvm_io_bus_init(&kvm->pio_bus); - spin_lock_init(&kvm->lock); + mutex_init(&kvm->lock); INIT_LIST_HEAD(&kvm->active_mmu_pages); kvm_io_bus_init(&kvm->mmio_bus); - for (i = 0; i < KVM_MAX_VCPUS; ++i) { - struct kvm_vcpu *vcpu = &kvm->vcpus[i]; - - mutex_init(&vcpu->mutex); - vcpu->cpu = -1; - vcpu->kvm = kvm; - vcpu->mmu.root_hpa = INVALID_PAGE; - } spin_lock(&kvm_lock); list_add(&kvm->vm_list, &vm_list); spin_unlock(&kvm_lock); return kvm; } -static int kvm_dev_open(struct inode *inode, struct file *filp) -{ - return 0; -} - /* * Free any memory in @free but not in @dont. */ @@ -353,7 +350,7 @@ static void free_pio_guest_pages(struct { int i; - for (i = 0; i < 2; ++i) + for (i = 0; i < ARRAY_SIZE(vcpu->pio.guest_pages); ++i) if (vcpu->pio.guest_pages[i]) { __free_page(vcpu->pio.guest_pages[i]); vcpu->pio.guest_pages[i] = NULL; @@ -362,30 +359,11 @@ static void free_pio_guest_pages(struct static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu) { - if (!vcpu->vmcs) - return; - vcpu_load(vcpu); kvm_mmu_unload(vcpu); vcpu_put(vcpu); } -static void kvm_free_vcpu(struct kvm_vcpu *vcpu) -{ - if (!vcpu->vmcs) - return; - - vcpu_load(vcpu); - kvm_mmu_destroy(vcpu); - vcpu_put(vcpu); - kvm_arch_ops->vcpu_free(vcpu); - free_page((unsigned long)vcpu->run); - vcpu->run = NULL; - free_page((unsigned long)vcpu->pio_data); - vcpu->pio_data = NULL; - free_pio_guest_pages(vcpu); -} - static void kvm_free_vcpus(struct kvm *kvm) { unsigned int i; @@ -394,14 +372,15 @@ static void kvm_free_vcpus(struct kvm *k * Unpin any mmu pages first. */ for (i = 0; i < KVM_MAX_VCPUS; ++i) - kvm_unload_vcpu_mmu(&kvm->vcpus[i]); - for (i = 0; i < KVM_MAX_VCPUS; ++i) - kvm_free_vcpu(&kvm->vcpus[i]); -} + if (kvm->vcpus[i]) + kvm_unload_vcpu_mmu(kvm->vcpus[i]); + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + if (kvm->vcpus[i]) { + kvm_x86_ops->vcpu_free(kvm->vcpus[i]); + kvm->vcpus[i] = NULL; + } + } -static int kvm_dev_release(struct inode *inode, struct file *filp) -{ - return 0; } static void kvm_destroy_vm(struct kvm *kvm) @@ -411,6 +390,8 @@ static void kvm_destroy_vm(struct kvm *k spin_unlock(&kvm_lock); kvm_io_bus_destroy(&kvm->pio_bus); kvm_io_bus_destroy(&kvm->mmio_bus); + kfree(kvm->vpic); + kfree(kvm->vioapic); kvm_free_vcpus(kvm); kvm_free_physmem(kvm); kfree(kvm); @@ -426,7 +407,7 @@ static int kvm_vm_release(struct inode * static void inject_gp(struct kvm_vcpu *vcpu) { - kvm_arch_ops->inject_gp(vcpu, 0); + kvm_x86_ops->inject_gp(vcpu, 0); } /* @@ -437,58 +418,60 @@ static int load_pdptrs(struct kvm_vcpu * gfn_t pdpt_gfn = cr3 >> PAGE_SHIFT; unsigned offset = ((cr3 & (PAGE_SIZE-1)) >> 5) << 2; int i; - u64 pdpte; u64 *pdpt; int ret; struct page *page; + u64 pdpte[ARRAY_SIZE(vcpu->pdptrs)]; - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); page = gfn_to_page(vcpu->kvm, pdpt_gfn); - /* FIXME: !page - emulate? 0xff? */ + if (!page) { + ret = 0; + goto out; + } + pdpt = kmap_atomic(page, KM_USER0); + memcpy(pdpte, pdpt+offset, sizeof(pdpte)); + kunmap_atomic(pdpt, KM_USER0); - ret = 1; - for (i = 0; i < 4; ++i) { - pdpte = pdpt[offset + i]; - if ((pdpte & 1) && (pdpte & 0xfffffff0000001e6ull)) { + for (i = 0; i < ARRAY_SIZE(pdpte); ++i) { + if ((pdpte[i] & 1) && (pdpte[i] & 0xfffffff0000001e6ull)) { ret = 0; goto out; } } + ret = 1; - for (i = 0; i < 4; ++i) - vcpu->pdptrs[i] = pdpt[offset + i]; - + memcpy(vcpu->pdptrs, pdpte, sizeof(vcpu->pdptrs)); out: - kunmap_atomic(pdpt, KM_USER0); - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); return ret; } void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) { - if (cr0 & CR0_RESEVED_BITS) { + if (cr0 & CR0_RESERVED_BITS) { printk(KERN_DEBUG "set_cr0: 0x%lx #GP, reserved bits 0x%lx\n", cr0, vcpu->cr0); inject_gp(vcpu); return; } - if ((cr0 & CR0_NW_MASK) && !(cr0 & CR0_CD_MASK)) { + if ((cr0 & X86_CR0_NW) && !(cr0 & X86_CR0_CD)) { printk(KERN_DEBUG "set_cr0: #GP, CD == 0 && NW == 1\n"); inject_gp(vcpu); return; } - if ((cr0 & CR0_PG_MASK) && !(cr0 & CR0_PE_MASK)) { + if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE)) { printk(KERN_DEBUG "set_cr0: #GP, set PG flag " "and a clear PE flag\n"); inject_gp(vcpu); return; } - if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) { + if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { #ifdef CONFIG_X86_64 if ((vcpu->shadow_efer & EFER_LME)) { int cs_db, cs_l; @@ -499,7 +482,7 @@ #ifdef CONFIG_X86_64 inject_gp(vcpu); return; } - kvm_arch_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); if (cs_l) { printk(KERN_DEBUG "set_cr0: #GP, start paging " "in long mode while CS.L == 1\n"); @@ -518,12 +501,12 @@ #endif } - kvm_arch_ops->set_cr0(vcpu, cr0); + kvm_x86_ops->set_cr0(vcpu, cr0); vcpu->cr0 = cr0; - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); kvm_mmu_reset_context(vcpu); - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); return; } EXPORT_SYMBOL_GPL(set_cr0); @@ -536,62 +519,69 @@ EXPORT_SYMBOL_GPL(lmsw); void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { - if (cr4 & CR4_RESEVED_BITS) { + if (cr4 & CR4_RESERVED_BITS) { printk(KERN_DEBUG "set_cr4: #GP, reserved bits\n"); inject_gp(vcpu); return; } if (is_long_mode(vcpu)) { - if (!(cr4 & CR4_PAE_MASK)) { + if (!(cr4 & X86_CR4_PAE)) { printk(KERN_DEBUG "set_cr4: #GP, clearing PAE while " "in long mode\n"); inject_gp(vcpu); return; } - } else if (is_paging(vcpu) && !is_pae(vcpu) && (cr4 & CR4_PAE_MASK) + } else if (is_paging(vcpu) && !is_pae(vcpu) && (cr4 & X86_CR4_PAE) && !load_pdptrs(vcpu, vcpu->cr3)) { printk(KERN_DEBUG "set_cr4: #GP, pdptrs reserved bits\n"); inject_gp(vcpu); + return; } - if (cr4 & CR4_VMXE_MASK) { + if (cr4 & X86_CR4_VMXE) { printk(KERN_DEBUG "set_cr4: #GP, setting VMXE\n"); inject_gp(vcpu); return; } - kvm_arch_ops->set_cr4(vcpu, cr4); - spin_lock(&vcpu->kvm->lock); + kvm_x86_ops->set_cr4(vcpu, cr4); + vcpu->cr4 = cr4; + mutex_lock(&vcpu->kvm->lock); kvm_mmu_reset_context(vcpu); - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); } EXPORT_SYMBOL_GPL(set_cr4); void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) { if (is_long_mode(vcpu)) { - if (cr3 & CR3_L_MODE_RESEVED_BITS) { + if (cr3 & CR3_L_MODE_RESERVED_BITS) { printk(KERN_DEBUG "set_cr3: #GP, reserved bits\n"); inject_gp(vcpu); return; } } else { - if (cr3 & CR3_RESEVED_BITS) { - printk(KERN_DEBUG "set_cr3: #GP, reserved bits\n"); - inject_gp(vcpu); - return; - } - if (is_paging(vcpu) && is_pae(vcpu) && - !load_pdptrs(vcpu, cr3)) { - printk(KERN_DEBUG "set_cr3: #GP, pdptrs " - "reserved bits\n"); - inject_gp(vcpu); - return; + if (is_pae(vcpu)) { + if (cr3 & CR3_PAE_RESERVED_BITS) { + printk(KERN_DEBUG + "set_cr3: #GP, reserved bits\n"); + inject_gp(vcpu); + return; + } + if (is_paging(vcpu) && !load_pdptrs(vcpu, cr3)) { + printk(KERN_DEBUG "set_cr3: #GP, pdptrs " + "reserved bits\n"); + inject_gp(vcpu); + return; + } } + /* + * We don't check reserved bits in nonpae mode, because + * this isn't enforced, and VMware depends on this. + */ } - vcpu->cr3 = cr3; - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); /* * Does the new cr3 value map to physical memory? (Note, we * catch an invalid cr3 even in real-mode, because it would @@ -603,46 +593,73 @@ void set_cr3(struct kvm_vcpu *vcpu, unsi */ if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT))) inject_gp(vcpu); - else + else { + vcpu->cr3 = cr3; vcpu->mmu.new_cr3(vcpu); - spin_unlock(&vcpu->kvm->lock); + } + mutex_unlock(&vcpu->kvm->lock); } EXPORT_SYMBOL_GPL(set_cr3); void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) { - if ( cr8 & CR8_RESEVED_BITS) { + if (cr8 & CR8_RESERVED_BITS) { printk(KERN_DEBUG "set_cr8: #GP, reserved bits 0x%lx\n", cr8); inject_gp(vcpu); return; } - vcpu->cr8 = cr8; + if (irqchip_in_kernel(vcpu->kvm)) + kvm_lapic_set_tpr(vcpu, cr8); + else + vcpu->cr8 = cr8; } EXPORT_SYMBOL_GPL(set_cr8); -void fx_init(struct kvm_vcpu *vcpu) +unsigned long get_cr8(struct kvm_vcpu *vcpu) { - struct __attribute__ ((__packed__)) fx_image_s { - u16 control; //fcw - u16 status; //fsw - u16 tag; // ftw - u16 opcode; //fop - u64 ip; // fpu ip - u64 operand;// fpu dp - u32 mxcsr; - u32 mxcsr_mask; + if (irqchip_in_kernel(vcpu->kvm)) + return kvm_lapic_get_cr8(vcpu); + else + return vcpu->cr8; +} +EXPORT_SYMBOL_GPL(get_cr8); - } *fx_image; +u64 kvm_get_apic_base(struct kvm_vcpu *vcpu) +{ + if (irqchip_in_kernel(vcpu->kvm)) + return vcpu->apic_base; + else + return vcpu->apic_base; +} +EXPORT_SYMBOL_GPL(kvm_get_apic_base); + +void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data) +{ + /* TODO: reserve bits check */ + if (irqchip_in_kernel(vcpu->kvm)) + kvm_lapic_set_base(vcpu, data); + else + vcpu->apic_base = data; +} +EXPORT_SYMBOL_GPL(kvm_set_apic_base); + +void fx_init(struct kvm_vcpu *vcpu) +{ + unsigned after_mxcsr_mask; - fx_save(vcpu->host_fx_image); + /* Initialize guest FPU by resetting ours and saving into guest's */ + preempt_disable(); + fx_save(&vcpu->host_fx_image); fpu_init(); - fx_save(vcpu->guest_fx_image); - fx_restore(vcpu->host_fx_image); + fx_save(&vcpu->guest_fx_image); + fx_restore(&vcpu->host_fx_image); + preempt_enable(); - fx_image = (struct fx_image_s *)vcpu->guest_fx_image; - fx_image->mxcsr = 0x1f80; - memset(vcpu->guest_fx_image + sizeof(struct fx_image_s), - 0, FX_IMAGE_SIZE - sizeof(struct fx_image_s)); + vcpu->cr0 |= X86_CR0_ET; + after_mxcsr_mask = offsetof(struct i387_fxsave_struct, st_space); + vcpu->guest_fx_image.mxcsr = 0x1f80; + memset((void *)&vcpu->guest_fx_image + after_mxcsr_mask, + 0, sizeof(struct i387_fxsave_struct) - after_mxcsr_mask); } EXPORT_SYMBOL_GPL(fx_init); @@ -661,7 +678,6 @@ static int kvm_vm_ioctl_set_memory_regio unsigned long i; struct kvm_memory_slot *memslot; struct kvm_memory_slot old, new; - int memory_config_version; r = -EINVAL; /* General sanity checks */ @@ -681,10 +697,8 @@ static int kvm_vm_ioctl_set_memory_regio if (!npages) mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES; -raced: - spin_lock(&kvm->lock); + mutex_lock(&kvm->lock); - memory_config_version = kvm->memory_config_version; new = old = *memslot; new.base_gfn = base_gfn; @@ -707,11 +721,6 @@ raced: (base_gfn >= s->base_gfn + s->npages))) goto out_unlock; } - /* - * Do memory allocations outside lock. memory_config_version will - * detect any races. - */ - spin_unlock(&kvm->lock); /* Deallocate if slot is being removed */ if (!npages) @@ -728,14 +737,14 @@ raced: new.phys_mem = vmalloc(npages * sizeof(struct page *)); if (!new.phys_mem) - goto out_free; + goto out_unlock; memset(new.phys_mem, 0, npages * sizeof(struct page *)); for (i = 0; i < npages; ++i) { new.phys_mem[i] = alloc_page(GFP_HIGHUSER | __GFP_ZERO); if (!new.phys_mem[i]) - goto out_free; + goto out_unlock; set_page_private(new.phys_mem[i],0); } } @@ -746,39 +755,25 @@ raced: new.dirty_bitmap = vmalloc(dirty_bytes); if (!new.dirty_bitmap) - goto out_free; + goto out_unlock; memset(new.dirty_bitmap, 0, dirty_bytes); } - spin_lock(&kvm->lock); - - if (memory_config_version != kvm->memory_config_version) { - spin_unlock(&kvm->lock); - kvm_free_physmem_slot(&new, &old); - goto raced; - } - - r = -EAGAIN; - if (kvm->busy) - goto out_unlock; - if (mem->slot >= kvm->nmemslots) kvm->nmemslots = mem->slot + 1; *memslot = new; - ++kvm->memory_config_version; kvm_mmu_slot_remove_write_access(kvm, mem->slot); kvm_flush_remote_tlbs(kvm); - spin_unlock(&kvm->lock); + mutex_unlock(&kvm->lock); kvm_free_physmem_slot(&old, &new); return 0; out_unlock: - spin_unlock(&kvm->lock); -out_free: + mutex_unlock(&kvm->lock); kvm_free_physmem_slot(&new, &old); out: return r; @@ -795,14 +790,8 @@ static int kvm_vm_ioctl_get_dirty_log(st int n; unsigned long any = 0; - spin_lock(&kvm->lock); + mutex_lock(&kvm->lock); - /* - * Prevent changes to guest memory configuration even while the lock - * is not taken. - */ - ++kvm->busy; - spin_unlock(&kvm->lock); r = -EINVAL; if (log->slot >= KVM_MEMORY_SLOTS) goto out; @@ -821,18 +810,17 @@ static int kvm_vm_ioctl_get_dirty_log(st if (copy_to_user(log->dirty_bitmap, memslot->dirty_bitmap, n)) goto out; - spin_lock(&kvm->lock); - kvm_mmu_slot_remove_write_access(kvm, log->slot); - kvm_flush_remote_tlbs(kvm); - memset(memslot->dirty_bitmap, 0, n); - spin_unlock(&kvm->lock); + /* If nothing is dirty, don't bother messing with page tables. */ + if (any) { + kvm_mmu_slot_remove_write_access(kvm, log->slot); + kvm_flush_remote_tlbs(kvm); + memset(memslot->dirty_bitmap, 0, n); + } r = 0; out: - spin_lock(&kvm->lock); - --kvm->busy; - spin_unlock(&kvm->lock); + mutex_unlock(&kvm->lock); return r; } @@ -862,7 +850,7 @@ static int kvm_vm_ioctl_set_memory_alias < alias->target_phys_addr) goto out; - spin_lock(&kvm->lock); + mutex_lock(&kvm->lock); p = &kvm->aliases[alias->slot]; p->base_gfn = alias->guest_phys_addr >> PAGE_SHIFT; @@ -876,7 +864,7 @@ static int kvm_vm_ioctl_set_memory_alias kvm_mmu_zap_all(kvm); - spin_unlock(&kvm->lock); + mutex_unlock(&kvm->lock); return 0; @@ -884,6 +872,63 @@ out: return r; } +static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip) +{ + int r; + + r = 0; + switch (chip->chip_id) { + case KVM_IRQCHIP_PIC_MASTER: + memcpy (&chip->chip.pic, + &pic_irqchip(kvm)->pics[0], + sizeof(struct kvm_pic_state)); + break; + case KVM_IRQCHIP_PIC_SLAVE: + memcpy (&chip->chip.pic, + &pic_irqchip(kvm)->pics[1], + sizeof(struct kvm_pic_state)); + break; + case KVM_IRQCHIP_IOAPIC: + memcpy (&chip->chip.ioapic, + ioapic_irqchip(kvm), + sizeof(struct kvm_ioapic_state)); + break; + default: + r = -EINVAL; + break; + } + return r; +} + +static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip) +{ + int r; + + r = 0; + switch (chip->chip_id) { + case KVM_IRQCHIP_PIC_MASTER: + memcpy (&pic_irqchip(kvm)->pics[0], + &chip->chip.pic, + sizeof(struct kvm_pic_state)); + break; + case KVM_IRQCHIP_PIC_SLAVE: + memcpy (&pic_irqchip(kvm)->pics[1], + &chip->chip.pic, + sizeof(struct kvm_pic_state)); + break; + case KVM_IRQCHIP_IOAPIC: + memcpy (ioapic_irqchip(kvm), + &chip->chip.ioapic, + sizeof(struct kvm_ioapic_state)); + break; + default: + r = -EINVAL; + break; + } + kvm_pic_update_irq(pic_irqchip(kvm)); + return r; +} + static gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn) { int i; @@ -930,37 +975,26 @@ struct page *gfn_to_page(struct kvm *kvm } EXPORT_SYMBOL_GPL(gfn_to_page); +/* WARNING: Does not work on aliased pages. */ void mark_page_dirty(struct kvm *kvm, gfn_t gfn) { - int i; struct kvm_memory_slot *memslot; - unsigned long rel_gfn; - for (i = 0; i < kvm->nmemslots; ++i) { - memslot = &kvm->memslots[i]; - - if (gfn >= memslot->base_gfn - && gfn < memslot->base_gfn + memslot->npages) { - - if (!memslot->dirty_bitmap) - return; - - rel_gfn = gfn - memslot->base_gfn; + memslot = __gfn_to_memslot(kvm, gfn); + if (memslot && memslot->dirty_bitmap) { + unsigned long rel_gfn = gfn - memslot->base_gfn; - /* avoid RMW */ - if (!test_bit(rel_gfn, memslot->dirty_bitmap)) - set_bit(rel_gfn, memslot->dirty_bitmap); - return; - } + /* avoid RMW */ + if (!test_bit(rel_gfn, memslot->dirty_bitmap)) + set_bit(rel_gfn, memslot->dirty_bitmap); } } -static int emulator_read_std(unsigned long addr, +int emulator_read_std(unsigned long addr, void *val, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = ctxt->vcpu; void *data = val; while (bytes) { @@ -990,26 +1024,42 @@ static int emulator_read_std(unsigned lo return X86EMUL_CONTINUE; } +EXPORT_SYMBOL_GPL(emulator_read_std); static int emulator_write_std(unsigned long addr, const void *val, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { - printk(KERN_ERR "emulator_write_std: addr %lx n %d\n", - addr, bytes); + pr_unimpl(vcpu, "emulator_write_std: addr %lx n %d\n", addr, bytes); return X86EMUL_UNHANDLEABLE; } +/* + * Only apic need an MMIO device hook, so shortcut now.. + */ +static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu, + gpa_t addr) +{ + struct kvm_io_device *dev; + + if (vcpu->apic) { + dev = &vcpu->apic->dev; + if (dev->in_range(dev, addr)) + return dev; + } + return NULL; +} + static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, gpa_t addr) { - /* - * Note that its important to have this wrapper function because - * in the very near future we will be checking for MMIOs against - * the LAPIC as well as the general MMIO bus - */ - return kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr); + struct kvm_io_device *dev; + + dev = vcpu_find_pervcpu_dev(vcpu, addr); + if (dev == NULL) + dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr); + return dev; } static struct kvm_io_device *vcpu_find_pio_dev(struct kvm_vcpu *vcpu, @@ -1021,9 +1071,8 @@ static struct kvm_io_device *vcpu_find_p static int emulator_read_emulated(unsigned long addr, void *val, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = ctxt->vcpu; struct kvm_io_device *mmio_dev; gpa_t gpa; @@ -1031,7 +1080,7 @@ static int emulator_read_emulated(unsign memcpy(val, vcpu->mmio_data, bytes); vcpu->mmio_read_completed = 0; return X86EMUL_CONTINUE; - } else if (emulator_read_std(addr, val, bytes, ctxt) + } else if (emulator_read_std(addr, val, bytes, vcpu) == X86EMUL_CONTINUE) return X86EMUL_CONTINUE; @@ -1061,7 +1110,6 @@ static int emulator_write_phys(struct kv { struct page *page; void *virt; - unsigned offset = offset_in_page(gpa); if (((gpa + bytes - 1) >> PAGE_SHIFT) != (gpa >> PAGE_SHIFT)) return 0; @@ -1070,7 +1118,7 @@ static int emulator_write_phys(struct kv return 0; mark_page_dirty(vcpu->kvm, gpa >> PAGE_SHIFT); virt = kmap_atomic(page, KM_USER0); - kvm_mmu_pte_write(vcpu, gpa, virt + offset, val, bytes); + kvm_mmu_pte_write(vcpu, gpa, val, bytes); memcpy(virt + offset_in_page(gpa), val, bytes); kunmap_atomic(virt, KM_USER0); return 1; @@ -1079,14 +1127,13 @@ static int emulator_write_phys(struct kv static int emulator_write_emulated_onepage(unsigned long addr, const void *val, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = ctxt->vcpu; struct kvm_io_device *mmio_dev; gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); if (gpa == UNMAPPED_GVA) { - kvm_arch_ops->inject_page_fault(vcpu, addr, 2); + kvm_x86_ops->inject_page_fault(vcpu, addr, 2); return X86EMUL_PROPAGATE_FAULT; } @@ -1111,31 +1158,32 @@ static int emulator_write_emulated_onepa return X86EMUL_CONTINUE; } -static int emulator_write_emulated(unsigned long addr, +int emulator_write_emulated(unsigned long addr, const void *val, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { /* Crossing a page boundary? */ if (((addr + bytes - 1) ^ addr) & PAGE_MASK) { int rc, now; now = -addr & ~PAGE_MASK; - rc = emulator_write_emulated_onepage(addr, val, now, ctxt); + rc = emulator_write_emulated_onepage(addr, val, now, vcpu); if (rc != X86EMUL_CONTINUE) return rc; addr += now; val += now; bytes -= now; } - return emulator_write_emulated_onepage(addr, val, bytes, ctxt); + return emulator_write_emulated_onepage(addr, val, bytes, vcpu); } +EXPORT_SYMBOL_GPL(emulator_write_emulated); static int emulator_cmpxchg_emulated(unsigned long addr, const void *old, const void *new, unsigned int bytes, - struct x86_emulate_ctxt *ctxt) + struct kvm_vcpu *vcpu) { static int reported; @@ -1143,12 +1191,12 @@ static int emulator_cmpxchg_emulated(uns reported = 1; printk(KERN_WARNING "kvm: emulating exchange as write\n"); } - return emulator_write_emulated(addr, new, bytes, ctxt); + return emulator_write_emulated(addr, new, bytes, vcpu); } static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg) { - return kvm_arch_ops->get_segment_base(vcpu, seg); + return kvm_x86_ops->get_segment_base(vcpu, seg); } int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address) @@ -1158,10 +1206,8 @@ int emulate_invlpg(struct kvm_vcpu *vcpu int emulate_clts(struct kvm_vcpu *vcpu) { - unsigned long cr0; - - cr0 = vcpu->cr0 & ~CR0_TS_MASK; - kvm_arch_ops->set_cr0(vcpu, cr0); + vcpu->cr0 &= ~X86_CR0_TS; + kvm_x86_ops->set_cr0(vcpu, vcpu->cr0); return X86EMUL_CONTINUE; } @@ -1171,11 +1217,10 @@ int emulator_get_dr(struct x86_emulate_c switch (dr) { case 0 ... 3: - *dest = kvm_arch_ops->get_dr(vcpu, dr); + *dest = kvm_x86_ops->get_dr(vcpu, dr); return X86EMUL_CONTINUE; default: - printk(KERN_DEBUG "%s: unexpected dr %u\n", - __FUNCTION__, dr); + pr_unimpl(vcpu, "%s: unexpected dr %u\n", __FUNCTION__, dr); return X86EMUL_UNHANDLEABLE; } } @@ -1185,7 +1230,7 @@ int emulator_set_dr(struct x86_emulate_c unsigned long mask = (ctxt->mode == X86EMUL_MODE_PROT64) ? ~0ULL : ~0U; int exception; - kvm_arch_ops->set_dr(ctxt->vcpu, dr, value & mask, &exception); + kvm_x86_ops->set_dr(ctxt->vcpu, dr, value & mask, &exception); if (exception) { /* FIXME: better handling */ return X86EMUL_UNHANDLEABLE; @@ -1193,25 +1238,25 @@ int emulator_set_dr(struct x86_emulate_c return X86EMUL_CONTINUE; } -static void report_emulation_failure(struct x86_emulate_ctxt *ctxt) +void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) { static int reported; u8 opcodes[4]; - unsigned long rip = ctxt->vcpu->rip; + unsigned long rip = vcpu->rip; unsigned long rip_linear; - rip_linear = rip + get_segment_base(ctxt->vcpu, VCPU_SREG_CS); + rip_linear = rip + get_segment_base(vcpu, VCPU_SREG_CS); if (reported) return; - emulator_read_std(rip_linear, (void *)opcodes, 4, ctxt); + emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu); - printk(KERN_ERR "emulation failed but !mmio_needed?" - " rip %lx %02x %02x %02x %02x\n", - rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]); + printk(KERN_ERR "emulation failed (%s) rip %lx %02x %02x %02x %02x\n", + context, rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]); reported = 1; } +EXPORT_SYMBOL_GPL(kvm_report_emulation_failure); struct x86_emulate_ops emulate_ops = { .read_std = emulator_read_std, @@ -1224,44 +1269,66 @@ struct x86_emulate_ops emulate_ops = { int emulate_instruction(struct kvm_vcpu *vcpu, struct kvm_run *run, unsigned long cr2, - u16 error_code) + u16 error_code, + int no_decode) { - struct x86_emulate_ctxt emulate_ctxt; int r; - int cs_db, cs_l; vcpu->mmio_fault_cr2 = cr2; - kvm_arch_ops->cache_regs(vcpu); - - kvm_arch_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); - - emulate_ctxt.vcpu = vcpu; - emulate_ctxt.eflags = kvm_arch_ops->get_rflags(vcpu); - emulate_ctxt.cr2 = cr2; - emulate_ctxt.mode = (emulate_ctxt.eflags & X86_EFLAGS_VM) - ? X86EMUL_MODE_REAL : cs_l - ? X86EMUL_MODE_PROT64 : cs_db - ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; - - if (emulate_ctxt.mode == X86EMUL_MODE_PROT64) { - emulate_ctxt.cs_base = 0; - emulate_ctxt.ds_base = 0; - emulate_ctxt.es_base = 0; - emulate_ctxt.ss_base = 0; - } else { - emulate_ctxt.cs_base = get_segment_base(vcpu, VCPU_SREG_CS); - emulate_ctxt.ds_base = get_segment_base(vcpu, VCPU_SREG_DS); - emulate_ctxt.es_base = get_segment_base(vcpu, VCPU_SREG_ES); - emulate_ctxt.ss_base = get_segment_base(vcpu, VCPU_SREG_SS); + kvm_x86_ops->cache_regs(vcpu); + + vcpu->mmio_is_write = 0; + vcpu->pio.string = 0; + + if (!no_decode) { + int cs_db, cs_l; + kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + + vcpu->emulate_ctxt.vcpu = vcpu; + vcpu->emulate_ctxt.eflags = kvm_x86_ops->get_rflags(vcpu); + vcpu->emulate_ctxt.cr2 = cr2; + vcpu->emulate_ctxt.mode = + (vcpu->emulate_ctxt.eflags & X86_EFLAGS_VM) + ? X86EMUL_MODE_REAL : cs_l + ? X86EMUL_MODE_PROT64 : cs_db + ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; + + if (vcpu->emulate_ctxt.mode == X86EMUL_MODE_PROT64) { + vcpu->emulate_ctxt.cs_base = 0; + vcpu->emulate_ctxt.ds_base = 0; + vcpu->emulate_ctxt.es_base = 0; + vcpu->emulate_ctxt.ss_base = 0; + } else { + vcpu->emulate_ctxt.cs_base = + get_segment_base(vcpu, VCPU_SREG_CS); + vcpu->emulate_ctxt.ds_base = + get_segment_base(vcpu, VCPU_SREG_DS); + vcpu->emulate_ctxt.es_base = + get_segment_base(vcpu, VCPU_SREG_ES); + vcpu->emulate_ctxt.ss_base = + get_segment_base(vcpu, VCPU_SREG_SS); + } + + vcpu->emulate_ctxt.gs_base = + get_segment_base(vcpu, VCPU_SREG_GS); + vcpu->emulate_ctxt.fs_base = + get_segment_base(vcpu, VCPU_SREG_FS); + + r = x86_decode_insn(&vcpu->emulate_ctxt, &emulate_ops); + if (r) { + if (kvm_mmu_unprotect_page_virt(vcpu, cr2)) + return EMULATE_DONE; + return EMULATE_FAIL; + } } - emulate_ctxt.gs_base = get_segment_base(vcpu, VCPU_SREG_GS); - emulate_ctxt.fs_base = get_segment_base(vcpu, VCPU_SREG_FS); + r = x86_emulate_insn(&vcpu->emulate_ctxt, &emulate_ops); - vcpu->mmio_is_write = 0; - r = x86_emulate_memop(&emulate_ctxt, &emulate_ops); + if (vcpu->pio.string) + return EMULATE_DO_MMIO; if ((r || vcpu->mmio_is_write) && run) { + run->exit_reason = KVM_EXIT_MMIO; run->mmio.phys_addr = vcpu->mmio_phys_addr; memcpy(run->mmio.data, vcpu->mmio_data, 8); run->mmio.len = vcpu->mmio_size; @@ -1272,14 +1339,14 @@ int emulate_instruction(struct kvm_vcpu if (kvm_mmu_unprotect_page_virt(vcpu, cr2)) return EMULATE_DONE; if (!vcpu->mmio_needed) { - report_emulation_failure(&emulate_ctxt); + kvm_report_emulation_failure(vcpu, "mmio"); return EMULATE_FAIL; } return EMULATE_DO_MMIO; } - kvm_arch_ops->decache_regs(vcpu); - kvm_arch_ops->set_rflags(vcpu, emulate_ctxt.eflags); + kvm_x86_ops->decache_regs(vcpu); + kvm_x86_ops->set_rflags(vcpu, vcpu->emulate_ctxt.eflags); if (vcpu->mmio_is_write) { vcpu->mmio_needed = 0; @@ -1290,61 +1357,103 @@ int emulate_instruction(struct kvm_vcpu } EXPORT_SYMBOL_GPL(emulate_instruction); -int kvm_emulate_halt(struct kvm_vcpu *vcpu) +/* + * The vCPU has executed a HLT instruction with in-kernel mode enabled. + */ +static void kvm_vcpu_block(struct kvm_vcpu *vcpu) { - if (vcpu->irq_summary) - return 1; + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(&vcpu->wq, &wait); + + /* + * We will block until either an interrupt or a signal wakes us up + */ + while (!kvm_cpu_has_interrupt(vcpu) + && !signal_pending(current) + && vcpu->mp_state != VCPU_MP_STATE_RUNNABLE + && vcpu->mp_state != VCPU_MP_STATE_SIPI_RECEIVED) { + set_current_state(TASK_INTERRUPTIBLE); + vcpu_put(vcpu); + schedule(); + vcpu_load(vcpu); + } + + __set_current_state(TASK_RUNNING); + remove_wait_queue(&vcpu->wq, &wait); +} - vcpu->run->exit_reason = KVM_EXIT_HLT; +int kvm_emulate_halt(struct kvm_vcpu *vcpu) +{ ++vcpu->stat.halt_exits; - return 0; + if (irqchip_in_kernel(vcpu->kvm)) { + vcpu->mp_state = VCPU_MP_STATE_HALTED; + kvm_vcpu_block(vcpu); + if (vcpu->mp_state != VCPU_MP_STATE_RUNNABLE) + return -EINTR; + return 1; + } else { + vcpu->run->exit_reason = KVM_EXIT_HLT; + return 0; + } } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; - kvm_arch_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; + kvm_x86_ops->cache_regs(vcpu); + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= 0xFFFFFFFF; + a0 &= 0xFFFFFFFF; + a1 &= 0xFFFFFFFF; + a2 &= 0xFFFFFFFF; + a3 &= 0xFFFFFFFF; } + switch (nr) { default: - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_arch_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; - kvm_arch_ops->decache_regs(vcpu); - return 1; + kvm_x86_ops->decache_regs(vcpu); + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + int ret = 0; + + mutex_lock(&vcpu->kvm->lock); + + /* + * Blow out the MMU to ensure that no other VCPU has an active mapping + * to ensure that the updated hypercall appears atomically across all + * VCPUs. + */ + kvm_mmu_zap_all(vcpu->kvm); + + kvm_x86_ops->cache_regs(vcpu); + kvm_x86_ops->patch_hypercall(vcpu, instruction); + if (emulator_write_emulated(vcpu->rip, instruction, 3, vcpu) + != X86EMUL_CONTINUE) + ret = -EFAULT; + + mutex_unlock(&vcpu->kvm->lock); + + return ret; } -EXPORT_SYMBOL_GPL(kvm_hypercall); static u64 mk_cr_64(u64 curr_cr, u32 new_val) { @@ -1355,26 +1464,26 @@ void realmode_lgdt(struct kvm_vcpu *vcpu { struct descriptor_table dt = { limit, base }; - kvm_arch_ops->set_gdt(vcpu, &dt); + kvm_x86_ops->set_gdt(vcpu, &dt); } void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base) { struct descriptor_table dt = { limit, base }; - kvm_arch_ops->set_idt(vcpu, &dt); + kvm_x86_ops->set_idt(vcpu, &dt); } void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, unsigned long *rflags) { lmsw(vcpu, msw); - *rflags = kvm_arch_ops->get_rflags(vcpu); + *rflags = kvm_x86_ops->get_rflags(vcpu); } unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr) { - kvm_arch_ops->decache_cr4_guest_bits(vcpu); + kvm_x86_ops->decache_cr4_guest_bits(vcpu); switch (cr) { case 0: return vcpu->cr0; @@ -1396,7 +1505,7 @@ void realmode_set_cr(struct kvm_vcpu *vc switch (cr) { case 0: set_cr0(vcpu, mk_cr_64(vcpu->cr0, val)); - *rflags = kvm_arch_ops->get_rflags(vcpu); + *rflags = kvm_x86_ops->get_rflags(vcpu); break; case 2: vcpu->cr2 = val; @@ -1412,75 +1521,6 @@ void realmode_set_cr(struct kvm_vcpu *vc } } -/* - * Register the para guest with the host: - */ -static int vcpu_register_para(struct kvm_vcpu *vcpu, gpa_t para_state_gpa) -{ - struct kvm_vcpu_para_state *para_state; - hpa_t para_state_hpa, hypercall_hpa; - struct page *para_state_page; - unsigned char *hypercall; - gpa_t hypercall_gpa; - - printk(KERN_DEBUG "kvm: guest trying to enter paravirtual mode\n"); - printk(KERN_DEBUG ".... para_state_gpa: %08Lx\n", para_state_gpa); - - /* - * Needs to be page aligned: - */ - if (para_state_gpa != PAGE_ALIGN(para_state_gpa)) - goto err_gp; - - para_state_hpa = gpa_to_hpa(vcpu, para_state_gpa); - printk(KERN_DEBUG ".... para_state_hpa: %08Lx\n", para_state_hpa); - if (is_error_hpa(para_state_hpa)) - goto err_gp; - - mark_page_dirty(vcpu->kvm, para_state_gpa >> PAGE_SHIFT); - para_state_page = pfn_to_page(para_state_hpa >> PAGE_SHIFT); - para_state = kmap_atomic(para_state_page, KM_USER0); - - printk(KERN_DEBUG ".... guest version: %d\n", para_state->guest_version); - printk(KERN_DEBUG ".... size: %d\n", para_state->size); - - para_state->host_version = KVM_PARA_API_VERSION; - /* - * We cannot support guests that try to register themselves - * with a newer API version than the host supports: - */ - if (para_state->guest_version > KVM_PARA_API_VERSION) { - para_state->ret = -KVM_EINVAL; - goto err_kunmap_skip; - } - - hypercall_gpa = para_state->hypercall_gpa; - hypercall_hpa = gpa_to_hpa(vcpu, hypercall_gpa); - printk(KERN_DEBUG ".... hypercall_hpa: %08Lx\n", hypercall_hpa); - if (is_error_hpa(hypercall_hpa)) { - para_state->ret = -KVM_EINVAL; - goto err_kunmap_skip; - } - - printk(KERN_DEBUG "kvm: para guest successfully registered.\n"); - vcpu->para_state_page = para_state_page; - vcpu->para_state_gpa = para_state_gpa; - vcpu->hypercall_gpa = hypercall_gpa; - - mark_page_dirty(vcpu->kvm, hypercall_gpa >> PAGE_SHIFT); - hypercall = kmap_atomic(pfn_to_page(hypercall_hpa >> PAGE_SHIFT), - KM_USER1) + (hypercall_hpa & ~PAGE_MASK); - kvm_arch_ops->patch_hypercall(vcpu, hypercall); - kunmap_atomic(hypercall, KM_USER1); - - para_state->ret = 0; -err_kunmap_skip: - kunmap_atomic(para_state, KM_USER0); - return 0; -err_gp: - return 1; -} - int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) { u64 data; @@ -1511,7 +1551,7 @@ int kvm_get_msr_common(struct kvm_vcpu * data = 3; break; case MSR_IA32_APICBASE: - data = vcpu->apic_base; + data = kvm_get_apic_base(vcpu); break; case MSR_IA32_MISC_ENABLE: data = vcpu->ia32_misc_enable_msr; @@ -1522,7 +1562,7 @@ #ifdef CONFIG_X86_64 break; #endif default: - printk(KERN_ERR "kvm: unhandled rdmsr: 0x%x\n", msr); + pr_unimpl(vcpu, "unhandled rdmsr: 0x%x\n", msr); return 1; } *pdata = data; @@ -1537,7 +1577,7 @@ EXPORT_SYMBOL_GPL(kvm_get_msr_common); */ int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) { - return kvm_arch_ops->get_msr(vcpu, msr_index, pdata); + return kvm_x86_ops->get_msr(vcpu, msr_index, pdata); } #ifdef CONFIG_X86_64 @@ -1558,7 +1598,7 @@ static void set_efer(struct kvm_vcpu *vc return; } - kvm_arch_ops->set_efer(vcpu, efer); + kvm_x86_ops->set_efer(vcpu, efer); efer &= ~EFER_LMA; efer |= vcpu->shadow_efer & EFER_LMA; @@ -1577,11 +1617,11 @@ #ifdef CONFIG_X86_64 break; #endif case MSR_IA32_MC0_STATUS: - printk(KERN_WARNING "%s: MSR_IA32_MC0_STATUS 0x%llx, nop\n", + pr_unimpl(vcpu, "%s: MSR_IA32_MC0_STATUS 0x%llx, nop\n", __FUNCTION__, data); break; case MSR_IA32_MCG_STATUS: - printk(KERN_WARNING "%s: MSR_IA32_MCG_STATUS 0x%llx, nop\n", + pr_unimpl(vcpu, "%s: MSR_IA32_MCG_STATUS 0x%llx, nop\n", __FUNCTION__, data); break; case MSR_IA32_UCODE_REV: @@ -1589,19 +1629,13 @@ #endif case 0x200 ... 0x2ff: /* MTRRs */ break; case MSR_IA32_APICBASE: - vcpu->apic_base = data; + kvm_set_apic_base(vcpu, data); break; case MSR_IA32_MISC_ENABLE: vcpu->ia32_misc_enable_msr = data; break; - /* - * This is the 'probe whether the host is KVM' logic: - */ - case MSR_KVM_API_MAGIC: - return vcpu_register_para(vcpu, data); - default: - printk(KERN_ERR "kvm: unhandled wrmsr: 0x%x\n", msr); + pr_unimpl(vcpu, "unhandled wrmsr: 0x%x\n", msr); return 1; } return 0; @@ -1615,44 +1649,24 @@ EXPORT_SYMBOL_GPL(kvm_set_msr_common); */ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) { - return kvm_arch_ops->set_msr(vcpu, msr_index, data); + return kvm_x86_ops->set_msr(vcpu, msr_index, data); } void kvm_resched(struct kvm_vcpu *vcpu) { if (!need_resched()) return; - vcpu_put(vcpu); cond_resched(); - vcpu_load(vcpu); } EXPORT_SYMBOL_GPL(kvm_resched); -void load_msrs(struct vmx_msr_entry *e, int n) -{ - int i; - - for (i = 0; i < n; ++i) - wrmsrl(e[i].index, e[i].data); -} -EXPORT_SYMBOL_GPL(load_msrs); - -void save_msrs(struct vmx_msr_entry *e, int n) -{ - int i; - - for (i = 0; i < n; ++i) - rdmsrl(e[i].index, e[i].data); -} -EXPORT_SYMBOL_GPL(save_msrs); - void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { int i; u32 function; struct kvm_cpuid_entry *e, *best; - kvm_arch_ops->cache_regs(vcpu); + kvm_x86_ops->cache_regs(vcpu); function = vcpu->regs[VCPU_REGS_RAX]; vcpu->regs[VCPU_REGS_RAX] = 0; vcpu->regs[VCPU_REGS_RBX] = 0; @@ -1678,8 +1692,8 @@ void kvm_emulate_cpuid(struct kvm_vcpu * vcpu->regs[VCPU_REGS_RCX] = best->ecx; vcpu->regs[VCPU_REGS_RDX] = best->edx; } - kvm_arch_ops->decache_regs(vcpu); - kvm_arch_ops->skip_emulated_instruction(vcpu); + kvm_x86_ops->decache_regs(vcpu); + kvm_x86_ops->skip_emulated_instruction(vcpu); } EXPORT_SYMBOL_GPL(kvm_emulate_cpuid); @@ -1690,11 +1704,9 @@ static int pio_copy_data(struct kvm_vcpu unsigned bytes; int nr_pages = vcpu->pio.guest_pages[1] ? 2 : 1; - kvm_arch_ops->vcpu_put(vcpu); q = vmap(vcpu->pio.guest_pages, nr_pages, VM_READ|VM_WRITE, PAGE_KERNEL); if (!q) { - kvm_arch_ops->vcpu_load(vcpu); free_pio_guest_pages(vcpu); return -ENOMEM; } @@ -1706,7 +1718,6 @@ static int pio_copy_data(struct kvm_vcpu memcpy(p, q, bytes); q -= vcpu->pio.guest_page_offset; vunmap(q); - kvm_arch_ops->vcpu_load(vcpu); free_pio_guest_pages(vcpu); return 0; } @@ -1717,7 +1728,7 @@ static int complete_pio(struct kvm_vcpu long delta; int r; - kvm_arch_ops->cache_regs(vcpu); + kvm_x86_ops->cache_regs(vcpu); if (!io->string) { if (io->in) @@ -1727,7 +1738,7 @@ static int complete_pio(struct kvm_vcpu if (io->in) { r = pio_copy_data(vcpu); if (r) { - kvm_arch_ops->cache_regs(vcpu); + kvm_x86_ops->cache_regs(vcpu); return r; } } @@ -1750,79 +1761,109 @@ static int complete_pio(struct kvm_vcpu vcpu->regs[VCPU_REGS_RSI] += delta; } - kvm_arch_ops->decache_regs(vcpu); + kvm_x86_ops->decache_regs(vcpu); io->count -= io->cur_count; io->cur_count = 0; - if (!io->count) - kvm_arch_ops->skip_emulated_instruction(vcpu); return 0; } -void kernel_pio(struct kvm_io_device *pio_dev, struct kvm_vcpu *vcpu) +static void kernel_pio(struct kvm_io_device *pio_dev, + struct kvm_vcpu *vcpu, + void *pd) { /* TODO: String I/O for in kernel device */ + mutex_lock(&vcpu->kvm->lock); if (vcpu->pio.in) kvm_iodevice_read(pio_dev, vcpu->pio.port, vcpu->pio.size, - vcpu->pio_data); + pd); else kvm_iodevice_write(pio_dev, vcpu->pio.port, vcpu->pio.size, - vcpu->pio_data); + pd); + mutex_unlock(&vcpu->kvm->lock); } -int kvm_setup_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, - int size, unsigned long count, int string, int down, +static void pio_string_write(struct kvm_io_device *pio_dev, + struct kvm_vcpu *vcpu) +{ + struct kvm_pio_request *io = &vcpu->pio; + void *pd = vcpu->pio_data; + int i; + + mutex_lock(&vcpu->kvm->lock); + for (i = 0; i < io->cur_count; i++) { + kvm_iodevice_write(pio_dev, io->port, + io->size, + pd); + pd += io->size; + } + mutex_unlock(&vcpu->kvm->lock); +} + +int kvm_emulate_pio (struct kvm_vcpu *vcpu, struct kvm_run *run, int in, + int size, unsigned port) +{ + struct kvm_io_device *pio_dev; + + vcpu->run->exit_reason = KVM_EXIT_IO; + vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT; + vcpu->run->io.size = vcpu->pio.size = size; + vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE; + vcpu->run->io.count = vcpu->pio.count = vcpu->pio.cur_count = 1; + vcpu->run->io.port = vcpu->pio.port = port; + vcpu->pio.in = in; + vcpu->pio.string = 0; + vcpu->pio.down = 0; + vcpu->pio.guest_page_offset = 0; + vcpu->pio.rep = 0; + + kvm_x86_ops->cache_regs(vcpu); + memcpy(vcpu->pio_data, &vcpu->regs[VCPU_REGS_RAX], 4); + kvm_x86_ops->decache_regs(vcpu); + + kvm_x86_ops->skip_emulated_instruction(vcpu); + + pio_dev = vcpu_find_pio_dev(vcpu, port); + if (pio_dev) { + kernel_pio(pio_dev, vcpu, vcpu->pio_data); + complete_pio(vcpu); + return 1; + } + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_pio); + +int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, + int size, unsigned long count, int down, gva_t address, int rep, unsigned port) { unsigned now, in_page; - int i; + int i, ret = 0; int nr_pages = 1; struct page *page; struct kvm_io_device *pio_dev; vcpu->run->exit_reason = KVM_EXIT_IO; vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT; - vcpu->run->io.size = size; + vcpu->run->io.size = vcpu->pio.size = size; vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE; - vcpu->run->io.count = count; - vcpu->run->io.port = port; - vcpu->pio.count = count; - vcpu->pio.cur_count = count; - vcpu->pio.size = size; + vcpu->run->io.count = vcpu->pio.count = vcpu->pio.cur_count = count; + vcpu->run->io.port = vcpu->pio.port = port; vcpu->pio.in = in; - vcpu->pio.port = port; - vcpu->pio.string = string; + vcpu->pio.string = 1; vcpu->pio.down = down; vcpu->pio.guest_page_offset = offset_in_page(address); vcpu->pio.rep = rep; - pio_dev = vcpu_find_pio_dev(vcpu, port); - if (!string) { - kvm_arch_ops->cache_regs(vcpu); - memcpy(vcpu->pio_data, &vcpu->regs[VCPU_REGS_RAX], 4); - kvm_arch_ops->decache_regs(vcpu); - if (pio_dev) { - kernel_pio(pio_dev, vcpu); - complete_pio(vcpu); - return 1; - } - return 0; - } - /* TODO: String I/O for in kernel device */ - if (pio_dev) - printk(KERN_ERR "kvm_setup_pio: no string io support\n"); - if (!count) { - kvm_arch_ops->skip_emulated_instruction(vcpu); + kvm_x86_ops->skip_emulated_instruction(vcpu); return 1; } - now = min(count, PAGE_SIZE / size); - if (!down) in_page = PAGE_SIZE - offset_in_page(address); else @@ -1841,20 +1882,23 @@ int kvm_setup_pio(struct kvm_vcpu *vcpu, /* * String I/O in reverse. Yuck. Kill the guest, fix later. */ - printk(KERN_ERR "kvm: guest string pio down\n"); + pr_unimpl(vcpu, "guest string pio down\n"); inject_gp(vcpu); return 1; } vcpu->run->io.count = now; vcpu->pio.cur_count = now; + if (vcpu->pio.cur_count == vcpu->pio.count) + kvm_x86_ops->skip_emulated_instruction(vcpu); + for (i = 0; i < nr_pages; ++i) { - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); page = gva_to_page(vcpu, address + i * PAGE_SIZE); if (page) get_page(page); vcpu->pio.guest_pages[i] = page; - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); if (!page) { inject_gp(vcpu); free_pio_guest_pages(vcpu); @@ -1862,11 +1906,145 @@ int kvm_setup_pio(struct kvm_vcpu *vcpu, } } - if (!vcpu->pio.in) - return pio_copy_data(vcpu); - return 0; + pio_dev = vcpu_find_pio_dev(vcpu, port); + if (!vcpu->pio.in) { + /* string PIO write */ + ret = pio_copy_data(vcpu); + if (ret >= 0 && pio_dev) { + pio_string_write(pio_dev, vcpu); + complete_pio(vcpu); + if (vcpu->pio.count == 0) + ret = 1; + } + } else if (pio_dev) + pr_unimpl(vcpu, "no string pio read support yet, " + "port %x size %d count %ld\n", + port, size, count); + + return ret; +} +EXPORT_SYMBOL_GPL(kvm_emulate_pio_string); + +/* + * Check if userspace requested an interrupt window, and that the + * interrupt window is open. + * + * No need to exit to userspace if we already have an interrupt queued. + */ +static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + return (!vcpu->irq_summary && + kvm_run->request_interrupt_window && + vcpu->interrupt_window_open && + (kvm_x86_ops->get_rflags(vcpu) & X86_EFLAGS_IF)); +} + +static void post_kvm_run_save(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + kvm_run->if_flag = (kvm_x86_ops->get_rflags(vcpu) & X86_EFLAGS_IF) != 0; + kvm_run->cr8 = get_cr8(vcpu); + kvm_run->apic_base = kvm_get_apic_base(vcpu); + if (irqchip_in_kernel(vcpu->kvm)) + kvm_run->ready_for_interrupt_injection = 1; + else + kvm_run->ready_for_interrupt_injection = + (vcpu->interrupt_window_open && + vcpu->irq_summary == 0); } -EXPORT_SYMBOL_GPL(kvm_setup_pio); + +static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + int r; + + if (unlikely(vcpu->mp_state == VCPU_MP_STATE_SIPI_RECEIVED)) { + printk("vcpu %d received sipi with vector # %x\n", + vcpu->vcpu_id, vcpu->sipi_vector); + kvm_lapic_reset(vcpu); + kvm_x86_ops->vcpu_reset(vcpu); + vcpu->mp_state = VCPU_MP_STATE_RUNNABLE; + } + +preempted: + if (vcpu->guest_debug.enabled) + kvm_x86_ops->guest_debug_pre(vcpu); + +again: + r = kvm_mmu_reload(vcpu); + if (unlikely(r)) + goto out; + + preempt_disable(); + + kvm_x86_ops->prepare_guest_switch(vcpu); + kvm_load_guest_fpu(vcpu); + + local_irq_disable(); + + if (signal_pending(current)) { + local_irq_enable(); + preempt_enable(); + r = -EINTR; + kvm_run->exit_reason = KVM_EXIT_INTR; + ++vcpu->stat.signal_exits; + goto out; + } + + if (irqchip_in_kernel(vcpu->kvm)) + kvm_x86_ops->inject_pending_irq(vcpu); + else if (!vcpu->mmio_read_completed) + kvm_x86_ops->inject_pending_vectors(vcpu, kvm_run); + + vcpu->guest_mode = 1; + + if (vcpu->requests) + if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) + kvm_x86_ops->tlb_flush(vcpu); + + kvm_x86_ops->run(vcpu, kvm_run); + + vcpu->guest_mode = 0; + local_irq_enable(); + + ++vcpu->stat.exits; + + preempt_enable(); + + /* + * Profile KVM exit RIPs: + */ + if (unlikely(prof_on == KVM_PROFILING)) { + kvm_x86_ops->cache_regs(vcpu); + profile_hit(KVM_PROFILING, (void *)vcpu->rip); + } + + r = kvm_x86_ops->handle_exit(kvm_run, vcpu); + + if (r > 0) { + if (dm_request_for_irq_injection(vcpu, kvm_run)) { + r = -EINTR; + kvm_run->exit_reason = KVM_EXIT_INTR; + ++vcpu->stat.request_irq_exits; + goto out; + } + if (!need_resched()) { + ++vcpu->stat.light_exits; + goto again; + } + } + +out: + if (r > 0) { + kvm_resched(vcpu); + goto preempted; + } + + post_kvm_run_save(vcpu, kvm_run); + + return r; +} + static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { @@ -1875,11 +2053,18 @@ static int kvm_vcpu_ioctl_run(struct kvm vcpu_load(vcpu); + if (unlikely(vcpu->mp_state == VCPU_MP_STATE_UNINITIALIZED)) { + kvm_vcpu_block(vcpu); + vcpu_put(vcpu); + return -EAGAIN; + } + if (vcpu->sigset_active) sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved); /* re-sync apic's tpr */ - vcpu->cr8 = kvm_run->cr8; + if (!irqchip_in_kernel(vcpu->kvm)) + set_cr8(vcpu, kvm_run->cr8); if (vcpu->pio.cur_count) { r = complete_pio(vcpu); @@ -1892,24 +2077,23 @@ static int kvm_vcpu_ioctl_run(struct kvm vcpu->mmio_read_completed = 1; vcpu->mmio_needed = 0; r = emulate_instruction(vcpu, kvm_run, - vcpu->mmio_fault_cr2, 0); + vcpu->mmio_fault_cr2, 0, 1); if (r == EMULATE_DO_MMIO) { /* * Read-modify-write. Back to userspace. */ - kvm_run->exit_reason = KVM_EXIT_MMIO; r = 0; goto out; } } if (kvm_run->exit_reason == KVM_EXIT_HYPERCALL) { - kvm_arch_ops->cache_regs(vcpu); + kvm_x86_ops->cache_regs(vcpu); vcpu->regs[VCPU_REGS_RAX] = kvm_run->hypercall.ret; - kvm_arch_ops->decache_regs(vcpu); + kvm_x86_ops->decache_regs(vcpu); } - r = kvm_arch_ops->run(vcpu, kvm_run); + r = __vcpu_run(vcpu, kvm_run); out: if (vcpu->sigset_active) @@ -1924,7 +2108,7 @@ static int kvm_vcpu_ioctl_get_regs(struc { vcpu_load(vcpu); - kvm_arch_ops->cache_regs(vcpu); + kvm_x86_ops->cache_regs(vcpu); regs->rax = vcpu->regs[VCPU_REGS_RAX]; regs->rbx = vcpu->regs[VCPU_REGS_RBX]; @@ -1946,7 +2130,7 @@ #ifdef CONFIG_X86_64 #endif regs->rip = vcpu->rip; - regs->rflags = kvm_arch_ops->get_rflags(vcpu); + regs->rflags = kvm_x86_ops->get_rflags(vcpu); /* * Don't leak debug flags in case they were set for guest debugging @@ -1984,9 +2168,9 @@ #ifdef CONFIG_X86_64 #endif vcpu->rip = regs->rip; - kvm_arch_ops->set_rflags(vcpu, regs->rflags); + kvm_x86_ops->set_rflags(vcpu, regs->rflags); - kvm_arch_ops->decache_regs(vcpu); + kvm_x86_ops->decache_regs(vcpu); vcpu_put(vcpu); @@ -1996,13 +2180,14 @@ #endif static void get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) { - return kvm_arch_ops->get_segment(vcpu, var, seg); + return kvm_x86_ops->get_segment(vcpu, var, seg); } static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { struct descriptor_table dt; + int pending_vec; vcpu_load(vcpu); @@ -2016,24 +2201,31 @@ static int kvm_vcpu_ioctl_get_sregs(stru get_segment(vcpu, &sregs->tr, VCPU_SREG_TR); get_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR); - kvm_arch_ops->get_idt(vcpu, &dt); + kvm_x86_ops->get_idt(vcpu, &dt); sregs->idt.limit = dt.limit; sregs->idt.base = dt.base; - kvm_arch_ops->get_gdt(vcpu, &dt); + kvm_x86_ops->get_gdt(vcpu, &dt); sregs->gdt.limit = dt.limit; sregs->gdt.base = dt.base; - kvm_arch_ops->decache_cr4_guest_bits(vcpu); + kvm_x86_ops->decache_cr4_guest_bits(vcpu); sregs->cr0 = vcpu->cr0; sregs->cr2 = vcpu->cr2; sregs->cr3 = vcpu->cr3; sregs->cr4 = vcpu->cr4; - sregs->cr8 = vcpu->cr8; + sregs->cr8 = get_cr8(vcpu); sregs->efer = vcpu->shadow_efer; - sregs->apic_base = vcpu->apic_base; - - memcpy(sregs->interrupt_bitmap, vcpu->irq_pending, - sizeof sregs->interrupt_bitmap); + sregs->apic_base = kvm_get_apic_base(vcpu); + + if (irqchip_in_kernel(vcpu->kvm)) { + memset(sregs->interrupt_bitmap, 0, + sizeof sregs->interrupt_bitmap); + pending_vec = kvm_x86_ops->get_irq(vcpu); + if (pending_vec >= 0) + set_bit(pending_vec, (unsigned long *)sregs->interrupt_bitmap); + } else + memcpy(sregs->interrupt_bitmap, vcpu->irq_pending, + sizeof sregs->interrupt_bitmap); vcpu_put(vcpu); @@ -2043,56 +2235,69 @@ static int kvm_vcpu_ioctl_get_sregs(stru static void set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) { - return kvm_arch_ops->set_segment(vcpu, var, seg); + return kvm_x86_ops->set_segment(vcpu, var, seg); } static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { int mmu_reset_needed = 0; - int i; + int i, pending_vec, max_bits; struct descriptor_table dt; vcpu_load(vcpu); dt.limit = sregs->idt.limit; dt.base = sregs->idt.base; - kvm_arch_ops->set_idt(vcpu, &dt); + kvm_x86_ops->set_idt(vcpu, &dt); dt.limit = sregs->gdt.limit; dt.base = sregs->gdt.base; - kvm_arch_ops->set_gdt(vcpu, &dt); + kvm_x86_ops->set_gdt(vcpu, &dt); vcpu->cr2 = sregs->cr2; mmu_reset_needed |= vcpu->cr3 != sregs->cr3; vcpu->cr3 = sregs->cr3; - vcpu->cr8 = sregs->cr8; + set_cr8(vcpu, sregs->cr8); mmu_reset_needed |= vcpu->shadow_efer != sregs->efer; #ifdef CONFIG_X86_64 - kvm_arch_ops->set_efer(vcpu, sregs->efer); + kvm_x86_ops->set_efer(vcpu, sregs->efer); #endif - vcpu->apic_base = sregs->apic_base; + kvm_set_apic_base(vcpu, sregs->apic_base); - kvm_arch_ops->decache_cr4_guest_bits(vcpu); + kvm_x86_ops->decache_cr4_guest_bits(vcpu); mmu_reset_needed |= vcpu->cr0 != sregs->cr0; - kvm_arch_ops->set_cr0(vcpu, sregs->cr0); + vcpu->cr0 = sregs->cr0; + kvm_x86_ops->set_cr0(vcpu, sregs->cr0); mmu_reset_needed |= vcpu->cr4 != sregs->cr4; - kvm_arch_ops->set_cr4(vcpu, sregs->cr4); + kvm_x86_ops->set_cr4(vcpu, sregs->cr4); if (!is_long_mode(vcpu) && is_pae(vcpu)) load_pdptrs(vcpu, vcpu->cr3); if (mmu_reset_needed) kvm_mmu_reset_context(vcpu); - memcpy(vcpu->irq_pending, sregs->interrupt_bitmap, - sizeof vcpu->irq_pending); - vcpu->irq_summary = 0; - for (i = 0; i < NR_IRQ_WORDS; ++i) - if (vcpu->irq_pending[i]) - __set_bit(i, &vcpu->irq_summary); + if (!irqchip_in_kernel(vcpu->kvm)) { + memcpy(vcpu->irq_pending, sregs->interrupt_bitmap, + sizeof vcpu->irq_pending); + vcpu->irq_summary = 0; + for (i = 0; i < ARRAY_SIZE(vcpu->irq_pending); ++i) + if (vcpu->irq_pending[i]) + __set_bit(i, &vcpu->irq_summary); + } else { + max_bits = (sizeof sregs->interrupt_bitmap) << 3; + pending_vec = find_first_bit( + (const unsigned long *)sregs->interrupt_bitmap, + max_bits); + /* Only pending external irq is handled here */ + if (pending_vec < max_bits) { + kvm_x86_ops->set_irq(vcpu, pending_vec); + printk("Set back pending irq %d\n", pending_vec); + } + } set_segment(vcpu, &sregs->cs, VCPU_SREG_CS); set_segment(vcpu, &sregs->ds, VCPU_SREG_DS); @@ -2109,6 +2314,16 @@ #endif return 0; } +void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) +{ + struct kvm_segment cs; + + get_segment(vcpu, &cs, VCPU_SREG_CS); + *db = cs.db; + *l = cs.l; +} +EXPORT_SYMBOL_GPL(kvm_get_cs_db_l_bits); + /* * List of msr numbers which we expose to userspace through KVM_GET_MSRS * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST. @@ -2236,13 +2451,13 @@ static int kvm_vcpu_ioctl_translate(stru gpa_t gpa; vcpu_load(vcpu); - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); gpa = vcpu->mmu.gva_to_gpa(vcpu, vaddr); tr->physical_address = gpa; tr->valid = gpa != UNMAPPED_GVA; tr->writeable = 1; tr->usermode = 0; - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); vcpu_put(vcpu); return 0; @@ -2253,6 +2468,8 @@ static int kvm_vcpu_ioctl_interrupt(stru { if (irq->irq < 0 || irq->irq >= 256) return -EINVAL; + if (irqchip_in_kernel(vcpu->kvm)) + return -ENXIO; vcpu_load(vcpu); set_bit(irq->irq, vcpu->irq_pending); @@ -2270,7 +2487,7 @@ static int kvm_vcpu_ioctl_debug_guest(st vcpu_load(vcpu); - r = kvm_arch_ops->set_guest_debug(vcpu, dbg); + r = kvm_x86_ops->set_guest_debug(vcpu, dbg); vcpu_put(vcpu); @@ -2285,7 +2502,6 @@ static struct page *kvm_vcpu_nopage(stru unsigned long pgoff; struct page *page; - *type = VM_FAULT_MINOR; pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; if (pgoff == 0) page = virt_to_page(vcpu->run); @@ -2294,6 +2510,9 @@ static struct page *kvm_vcpu_nopage(stru else return NOPAGE_SIGBUS; get_page(page); + if (type != NULL) + *type = VM_FAULT_MINOR; + return page; } @@ -2346,74 +2565,52 @@ static int kvm_vm_ioctl_create_vcpu(stru { int r; struct kvm_vcpu *vcpu; - struct page *page; - r = -EINVAL; if (!valid_vcpu(n)) - goto out; - - vcpu = &kvm->vcpus[n]; - - mutex_lock(&vcpu->mutex); - - if (vcpu->vmcs) { - mutex_unlock(&vcpu->mutex); - return -EEXIST; - } - - page = alloc_page(GFP_KERNEL | __GFP_ZERO); - r = -ENOMEM; - if (!page) - goto out_unlock; - vcpu->run = page_address(page); - - page = alloc_page(GFP_KERNEL | __GFP_ZERO); - r = -ENOMEM; - if (!page) - goto out_free_run; - vcpu->pio_data = page_address(page); + return -EINVAL; - vcpu->host_fx_image = (char*)ALIGN((hva_t)vcpu->fx_buf, - FX_IMAGE_ALIGN); - vcpu->guest_fx_image = vcpu->host_fx_image + FX_IMAGE_SIZE; - vcpu->cr0 = 0x10; + vcpu = kvm_x86_ops->vcpu_create(kvm, n); + if (IS_ERR(vcpu)) + return PTR_ERR(vcpu); - r = kvm_arch_ops->vcpu_create(vcpu); - if (r < 0) - goto out_free_vcpus; + preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops); - r = kvm_mmu_create(vcpu); - if (r < 0) - goto out_free_vcpus; + /* We do fxsave: this must be aligned. */ + BUG_ON((unsigned long)&vcpu->host_fx_image & 0xF); - kvm_arch_ops->vcpu_load(vcpu); + vcpu_load(vcpu); r = kvm_mmu_setup(vcpu); - if (r >= 0) - r = kvm_arch_ops->vcpu_setup(vcpu); vcpu_put(vcpu); - if (r < 0) - goto out_free_vcpus; + goto free_vcpu; + mutex_lock(&kvm->lock); + if (kvm->vcpus[n]) { + r = -EEXIST; + mutex_unlock(&kvm->lock); + goto mmu_unload; + } + kvm->vcpus[n] = vcpu; + mutex_unlock(&kvm->lock); + + /* Now it's all set up, let userspace reach it */ r = create_vcpu_fd(vcpu); if (r < 0) - goto out_free_vcpus; + goto unlink; + return r; - spin_lock(&kvm_lock); - if (n >= kvm->nvcpus) - kvm->nvcpus = n + 1; - spin_unlock(&kvm_lock); +unlink: + mutex_lock(&kvm->lock); + kvm->vcpus[n] = NULL; + mutex_unlock(&kvm->lock); - return r; +mmu_unload: + vcpu_load(vcpu); + kvm_mmu_unload(vcpu); + vcpu_put(vcpu); -out_free_vcpus: - kvm_free_vcpu(vcpu); -out_free_run: - free_page((unsigned long)vcpu->run); - vcpu->run = NULL; -out_unlock: - mutex_unlock(&vcpu->mutex); -out: +free_vcpu: + kvm_x86_ops->vcpu_free(vcpu); return r; } @@ -2493,7 +2690,7 @@ #endif static int kvm_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { - struct fxsave *fxsave = (struct fxsave *)vcpu->guest_fx_image; + struct fxsave *fxsave = (struct fxsave *)&vcpu->guest_fx_image; vcpu_load(vcpu); @@ -2513,7 +2710,7 @@ static int kvm_vcpu_ioctl_get_fpu(struct static int kvm_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { - struct fxsave *fxsave = (struct fxsave *)vcpu->guest_fx_image; + struct fxsave *fxsave = (struct fxsave *)&vcpu->guest_fx_image; vcpu_load(vcpu); @@ -2531,6 +2728,27 @@ static int kvm_vcpu_ioctl_set_fpu(struct return 0; } +static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, + struct kvm_lapic_state *s) +{ + vcpu_load(vcpu); + memcpy(s->regs, vcpu->apic->regs, sizeof *s); + vcpu_put(vcpu); + + return 0; +} + +static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, + struct kvm_lapic_state *s) +{ + vcpu_load(vcpu); + memcpy(vcpu->apic->regs, s->regs, sizeof *s); + kvm_apic_post_state_restore(vcpu); + vcpu_put(vcpu); + + return 0; +} + static long kvm_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2700,6 +2918,31 @@ static long kvm_vcpu_ioctl(struct file * r = 0; break; } + case KVM_GET_LAPIC: { + struct kvm_lapic_state lapic; + + memset(&lapic, 0, sizeof lapic); + r = kvm_vcpu_ioctl_get_lapic(vcpu, &lapic); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, &lapic, sizeof lapic)) + goto out; + r = 0; + break; + } + case KVM_SET_LAPIC: { + struct kvm_lapic_state lapic; + + r = -EFAULT; + if (copy_from_user(&lapic, argp, sizeof lapic)) + goto out; + r = kvm_vcpu_ioctl_set_lapic(vcpu, &lapic);; + if (r) + goto out; + r = 0; + break; + } default: ; } @@ -2753,6 +2996,75 @@ static long kvm_vm_ioctl(struct file *fi goto out; break; } + case KVM_CREATE_IRQCHIP: + r = -ENOMEM; + kvm->vpic = kvm_create_pic(kvm); + if (kvm->vpic) { + r = kvm_ioapic_init(kvm); + if (r) { + kfree(kvm->vpic); + kvm->vpic = NULL; + goto out; + } + } + else + goto out; + break; + case KVM_IRQ_LINE: { + struct kvm_irq_level irq_event; + + r = -EFAULT; + if (copy_from_user(&irq_event, argp, sizeof irq_event)) + goto out; + if (irqchip_in_kernel(kvm)) { + mutex_lock(&kvm->lock); + if (irq_event.irq < 16) + kvm_pic_set_irq(pic_irqchip(kvm), + irq_event.irq, + irq_event.level); + kvm_ioapic_set_irq(kvm->vioapic, + irq_event.irq, + irq_event.level); + mutex_unlock(&kvm->lock); + r = 0; + } + break; + } + case KVM_GET_IRQCHIP: { + /* 0: PIC master, 1: PIC slave, 2: IOAPIC */ + struct kvm_irqchip chip; + + r = -EFAULT; + if (copy_from_user(&chip, argp, sizeof chip)) + goto out; + r = -ENXIO; + if (!irqchip_in_kernel(kvm)) + goto out; + r = kvm_vm_ioctl_get_irqchip(kvm, &chip); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, &chip, sizeof chip)) + goto out; + r = 0; + break; + } + case KVM_SET_IRQCHIP: { + /* 0: PIC master, 1: PIC slave, 2: IOAPIC */ + struct kvm_irqchip chip; + + r = -EFAULT; + if (copy_from_user(&chip, argp, sizeof chip)) + goto out; + r = -ENXIO; + if (!irqchip_in_kernel(kvm)) + goto out; + r = kvm_vm_ioctl_set_irqchip(kvm, &chip); + if (r) + goto out; + r = 0; + break; + } default: ; } @@ -2768,12 +3080,14 @@ static struct page *kvm_vm_nopage(struct unsigned long pgoff; struct page *page; - *type = VM_FAULT_MINOR; pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; page = gfn_to_page(kvm, pgoff); if (!page) return NOPAGE_SIGBUS; get_page(page); + if (type != NULL) + *type = VM_FAULT_MINOR; + return page; } @@ -2861,12 +3175,20 @@ static long kvm_dev_ioctl(struct file *f r = 0; break; } - case KVM_CHECK_EXTENSION: - /* - * No extensions defined at present. - */ - r = 0; + case KVM_CHECK_EXTENSION: { + int ext = (long)argp; + + switch (ext) { + case KVM_CAP_IRQCHIP: + case KVM_CAP_HLT: + r = 1; + break; + default: + r = 0; + break; + } break; + } case KVM_GET_VCPU_MMAP_SIZE: r = -EINVAL; if (arg) @@ -2881,8 +3203,6 @@ out: } static struct file_operations kvm_chardev_ops = { - .open = kvm_dev_open, - .release = kvm_dev_release, .unlocked_ioctl = kvm_dev_ioctl, .compat_ioctl = kvm_dev_ioctl, }; @@ -2893,25 +3213,6 @@ static struct miscdevice kvm_dev = { &kvm_chardev_ops, }; -static int kvm_reboot(struct notifier_block *notifier, unsigned long val, - void *v) -{ - if (val == SYS_RESTART) { - /* - * Some (well, at least mine) BIOSes hang on reboot if - * in vmx root mode. - */ - printk(KERN_INFO "kvm: exiting hardware virtualization\n"); - on_each_cpu(hardware_disable, NULL, 0, 1); - } - return NOTIFY_OK; -} - -static struct notifier_block kvm_reboot_notifier = { - .notifier_call = kvm_reboot, - .priority = 0, -}; - /* * Make sure that a cpu that is being hot-unplugged does not have any vcpus * cached on it. @@ -2925,7 +3226,9 @@ static void decache_vcpus_on_cpu(int cpu spin_lock(&kvm_lock); list_for_each_entry(vm, &vm_list, vm_list) for (i = 0; i < KVM_MAX_VCPUS; ++i) { - vcpu = &vm->vcpus[i]; + vcpu = vm->vcpus[i]; + if (!vcpu) + continue; /* * If the vcpu is locked, then it is running on some * other cpu and therefore it is not cached on the @@ -2936,7 +3239,7 @@ static void decache_vcpus_on_cpu(int cpu */ if (mutex_trylock(&vcpu->mutex)) { if (vcpu->cpu == cpu) { - kvm_arch_ops->vcpu_decache(vcpu); + kvm_x86_ops->vcpu_decache(vcpu); vcpu->cpu = -1; } mutex_unlock(&vcpu->mutex); @@ -2952,7 +3255,7 @@ static void hardware_enable(void *junk) if (cpu_isset(cpu, cpus_hardware_enabled)) return; cpu_set(cpu, cpus_hardware_enabled); - kvm_arch_ops->hardware_enable(NULL); + kvm_x86_ops->hardware_enable(NULL); } static void hardware_disable(void *junk) @@ -2963,7 +3266,7 @@ static void hardware_disable(void *junk) return; cpu_clear(cpu, cpus_hardware_enabled); decache_vcpus_on_cpu(cpu); - kvm_arch_ops->hardware_disable(NULL); + kvm_x86_ops->hardware_disable(NULL); } static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val, @@ -2994,6 +3297,25 @@ static int kvm_cpu_hotplug(struct notifi return NOTIFY_OK; } +static int kvm_reboot(struct notifier_block *notifier, unsigned long val, + void *v) +{ + if (val == SYS_RESTART) { + /* + * Some (well, at least mine) BIOSes hang on reboot if + * in vmx root mode. + */ + printk(KERN_INFO "kvm: exiting hardware virtualization\n"); + on_each_cpu(hardware_disable, NULL, 0, 1); + } + return NOTIFY_OK; +} + +static struct notifier_block kvm_reboot_notifier = { + .notifier_call = kvm_reboot, + .priority = 0, +}; + void kvm_io_bus_init(struct kvm_io_bus *bus) { memset(bus, 0, sizeof(*bus)); @@ -3047,18 +3369,15 @@ static u64 stat_get(void *_offset) spin_lock(&kvm_lock); list_for_each_entry(kvm, &vm_list, vm_list) for (i = 0; i < KVM_MAX_VCPUS; ++i) { - vcpu = &kvm->vcpus[i]; - total += *(u32 *)((void *)vcpu + offset); + vcpu = kvm->vcpus[i]; + if (vcpu) + total += *(u32 *)((void *)vcpu + offset); } spin_unlock(&kvm_lock); return total; } -static void stat_set(void *offset, u64 val) -{ -} - -DEFINE_SIMPLE_ATTRIBUTE(stat_fops, stat_get, stat_set, "%llu\n"); +DEFINE_SIMPLE_ATTRIBUTE(stat_fops, stat_get, NULL, "%llu\n"); static __init void kvm_init_debug(void) { @@ -3105,11 +3424,34 @@ static struct sys_device kvm_sysdev = { hpa_t bad_page_address; -int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module) +static inline +struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn) +{ + return container_of(pn, struct kvm_vcpu, preempt_notifier); +} + +static void kvm_sched_in(struct preempt_notifier *pn, int cpu) +{ + struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + + kvm_x86_ops->vcpu_load(vcpu, cpu); +} + +static void kvm_sched_out(struct preempt_notifier *pn, + struct task_struct *next) +{ + struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + + kvm_x86_ops->vcpu_put(vcpu); +} + +int kvm_init_x86(struct kvm_x86_ops *ops, unsigned int vcpu_size, + struct module *module) { int r; + int cpu; - if (kvm_arch_ops) { + if (kvm_x86_ops) { printk(KERN_ERR "kvm: already loaded the other module\n"); return -EEXIST; } @@ -3123,12 +3465,20 @@ int kvm_init_arch(struct kvm_arch_ops *o return -EOPNOTSUPP; } - kvm_arch_ops = ops; + kvm_x86_ops = ops; - r = kvm_arch_ops->hardware_setup(); + r = kvm_x86_ops->hardware_setup(); if (r < 0) goto out; + for_each_online_cpu(cpu) { + smp_call_function_single(cpu, + kvm_x86_ops->check_processor_compatibility, + &r, 0, 1); + if (r < 0) + goto out_free_0; + } + on_each_cpu(hardware_enable, NULL, 0, 1); r = register_cpu_notifier(&kvm_cpu_notifier); if (r) @@ -3143,6 +3493,14 @@ int kvm_init_arch(struct kvm_arch_ops *o if (r) goto out_free_3; + /* A kmem cache lets us meet the alignment requirements of fx_save. */ + kvm_vcpu_cache = kmem_cache_create("kvm_vcpu", vcpu_size, + __alignof__(struct kvm_vcpu), 0, 0); + if (!kvm_vcpu_cache) { + r = -ENOMEM; + goto out_free_4; + } + kvm_chardev_ops.owner = module; r = misc_register(&kvm_dev); @@ -3151,9 +3509,16 @@ int kvm_init_arch(struct kvm_arch_ops *o goto out_free; } - return r; + kvm_preempt_ops.sched_in = kvm_sched_in; + kvm_preempt_ops.sched_out = kvm_sched_out; + + kvm_mmu_set_nonpresent_ptes(0ull, 0ull); + + return 0; out_free: + kmem_cache_destroy(kvm_vcpu_cache); +out_free_4: sysdev_unregister(&kvm_sysdev); out_free_3: sysdev_class_unregister(&kvm_sysdev_class); @@ -3162,22 +3527,24 @@ out_free_2: unregister_cpu_notifier(&kvm_cpu_notifier); out_free_1: on_each_cpu(hardware_disable, NULL, 0, 1); - kvm_arch_ops->hardware_unsetup(); +out_free_0: + kvm_x86_ops->hardware_unsetup(); out: - kvm_arch_ops = NULL; + kvm_x86_ops = NULL; return r; } -void kvm_exit_arch(void) +void kvm_exit_x86(void) { misc_deregister(&kvm_dev); + kmem_cache_destroy(kvm_vcpu_cache); sysdev_unregister(&kvm_sysdev); sysdev_class_unregister(&kvm_sysdev_class); unregister_reboot_notifier(&kvm_reboot_notifier); unregister_cpu_notifier(&kvm_cpu_notifier); on_each_cpu(hardware_disable, NULL, 0, 1); - kvm_arch_ops->hardware_unsetup(); - kvm_arch_ops = NULL; + kvm_x86_ops->hardware_unsetup(); + kvm_x86_ops = NULL; } static __init int kvm_init(void) @@ -3220,5 +3587,5 @@ static __exit void kvm_exit(void) module_init(kvm_init) module_exit(kvm_exit) -EXPORT_SYMBOL_GPL(kvm_init_arch); -EXPORT_SYMBOL_GPL(kvm_exit_arch); +EXPORT_SYMBOL_GPL(kvm_init_x86); +EXPORT_SYMBOL_GPL(kvm_exit_x86); diff --git a/drivers/kvm/kvm_svm.h b/drivers/kvm/kvm_svm.h index a869983..a0e415d 100644 --- a/drivers/kvm/kvm_svm.h +++ b/drivers/kvm/kvm_svm.h @@ -20,7 +20,10 @@ #endif #define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs) #define NUM_DB_REGS 4 +struct kvm_vcpu; + struct vcpu_svm { + struct kvm_vcpu vcpu; struct vmcb *vmcb; unsigned long vmcb_pa; struct svm_cpu_data *svm_data; diff --git a/drivers/kvm/lapic.c b/drivers/kvm/lapic.c new file mode 100644 index 0000000..6e0f7e5 --- /dev/null +++ b/drivers/kvm/lapic.c @@ -0,0 +1,1058 @@ + +/* + * Local APIC virtualization + * + * Copyright (C) 2006 Qumranet, Inc. + * Copyright (C) 2007 Novell + * Copyright (C) 2007 Intel + * + * Authors: + * Dor Laor + * Gregory Haskins + * Yaozu (Eddie) Dong + * + * Based on Xen 3.1 code, Copyright (c) 2004, Intel Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "kvm.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "irq.h" + +#define PRId64 "d" +#define PRIx64 "llx" +#define PRIu64 "u" +#define PRIo64 "o" + +#define APIC_BUS_CYCLE_NS 1 + +/* #define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg) */ +#define apic_debug(fmt, arg...) + +#define APIC_LVT_NUM 6 +/* 14 is the version for Xeon and Pentium 8.4.8*/ +#define APIC_VERSION (0x14UL | ((APIC_LVT_NUM - 1) << 16)) +#define LAPIC_MMIO_LENGTH (1 << 12) +/* followed define is not in apicdef.h */ +#define APIC_SHORT_MASK 0xc0000 +#define APIC_DEST_NOSHORT 0x0 +#define APIC_DEST_MASK 0x800 +#define MAX_APIC_VECTOR 256 + +#define VEC_POS(v) ((v) & (32 - 1)) +#define REG_POS(v) (((v) >> 5) << 4) +static inline u32 apic_get_reg(struct kvm_lapic *apic, int reg_off) +{ + return *((u32 *) (apic->regs + reg_off)); +} + +static inline void apic_set_reg(struct kvm_lapic *apic, int reg_off, u32 val) +{ + *((u32 *) (apic->regs + reg_off)) = val; +} + +static inline int apic_test_and_set_vector(int vec, void *bitmap) +{ + return test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + +static inline int apic_test_and_clear_vector(int vec, void *bitmap) +{ + return test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + +static inline void apic_set_vector(int vec, void *bitmap) +{ + set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + +static inline void apic_clear_vector(int vec, void *bitmap) +{ + clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + +static inline int apic_hw_enabled(struct kvm_lapic *apic) +{ + return (apic)->vcpu->apic_base & MSR_IA32_APICBASE_ENABLE; +} + +static inline int apic_sw_enabled(struct kvm_lapic *apic) +{ + return apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_APIC_ENABLED; +} + +static inline int apic_enabled(struct kvm_lapic *apic) +{ + return apic_sw_enabled(apic) && apic_hw_enabled(apic); +} + +#define LVT_MASK \ + (APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK) + +#define LINT_MASK \ + (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ + APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) + +static inline int kvm_apic_id(struct kvm_lapic *apic) +{ + return (apic_get_reg(apic, APIC_ID) >> 24) & 0xff; +} + +static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type) +{ + return !(apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED); +} + +static inline int apic_lvt_vector(struct kvm_lapic *apic, int lvt_type) +{ + return apic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK; +} + +static inline int apic_lvtt_period(struct kvm_lapic *apic) +{ + return apic_get_reg(apic, APIC_LVTT) & APIC_LVT_TIMER_PERIODIC; +} + +static unsigned int apic_lvt_mask[APIC_LVT_NUM] = { + LVT_MASK | APIC_LVT_TIMER_PERIODIC, /* LVTT */ + LVT_MASK | APIC_MODE_MASK, /* LVTTHMR */ + LVT_MASK | APIC_MODE_MASK, /* LVTPC */ + LINT_MASK, LINT_MASK, /* LVT0-1 */ + LVT_MASK /* LVTERR */ +}; + +static int find_highest_vector(void *bitmap) +{ + u32 *word = bitmap; + int word_offset = MAX_APIC_VECTOR >> 5; + + while ((word_offset != 0) && (word[(--word_offset) << 2] == 0)) + continue; + + if (likely(!word_offset && !word[0])) + return -1; + else + return fls(word[word_offset << 2]) - 1 + (word_offset << 5); +} + +static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic) +{ + return apic_test_and_set_vector(vec, apic->regs + APIC_IRR); +} + +static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) +{ + apic_clear_vector(vec, apic->regs + APIC_IRR); +} + +static inline int apic_find_highest_irr(struct kvm_lapic *apic) +{ + int result; + + result = find_highest_vector(apic->regs + APIC_IRR); + ASSERT(result == -1 || result >= 16); + + return result; +} + +int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)vcpu->apic; + int highest_irr; + + if (!apic) + return 0; + highest_irr = apic_find_highest_irr(apic); + + return highest_irr; +} +EXPORT_SYMBOL_GPL(kvm_lapic_find_highest_irr); + +int kvm_apic_set_irq(struct kvm_lapic *apic, u8 vec, u8 trig) +{ + if (!apic_test_and_set_irr(vec, apic)) { + /* a new pending irq is set in IRR */ + if (trig) + apic_set_vector(vec, apic->regs + APIC_TMR); + else + apic_clear_vector(vec, apic->regs + APIC_TMR); + kvm_vcpu_kick(apic->vcpu); + return 1; + } + return 0; +} + +static inline int apic_find_highest_isr(struct kvm_lapic *apic) +{ + int result; + + result = find_highest_vector(apic->regs + APIC_ISR); + ASSERT(result == -1 || result >= 16); + + return result; +} + +static void apic_update_ppr(struct kvm_lapic *apic) +{ + u32 tpr, isrv, ppr; + int isr; + + tpr = apic_get_reg(apic, APIC_TASKPRI); + isr = apic_find_highest_isr(apic); + isrv = (isr != -1) ? isr : 0; + + if ((tpr & 0xf0) >= (isrv & 0xf0)) + ppr = tpr & 0xff; + else + ppr = isrv & 0xf0; + + apic_debug("vlapic %p, ppr 0x%x, isr 0x%x, isrv 0x%x", + apic, ppr, isr, isrv); + + apic_set_reg(apic, APIC_PROCPRI, ppr); +} + +static void apic_set_tpr(struct kvm_lapic *apic, u32 tpr) +{ + apic_set_reg(apic, APIC_TASKPRI, tpr); + apic_update_ppr(apic); +} + +int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest) +{ + return kvm_apic_id(apic) == dest; +} + +int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda) +{ + int result = 0; + u8 logical_id; + + logical_id = GET_APIC_LOGICAL_ID(apic_get_reg(apic, APIC_LDR)); + + switch (apic_get_reg(apic, APIC_DFR)) { + case APIC_DFR_FLAT: + if (logical_id & mda) + result = 1; + break; + case APIC_DFR_CLUSTER: + if (((logical_id >> 4) == (mda >> 0x4)) + && (logical_id & mda & 0xf)) + result = 1; + break; + default: + printk(KERN_WARNING "Bad DFR vcpu %d: %08x\n", + apic->vcpu->vcpu_id, apic_get_reg(apic, APIC_DFR)); + break; + } + + return result; +} + +static int apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, + int short_hand, int dest, int dest_mode) +{ + int result = 0; + struct kvm_lapic *target = vcpu->apic; + + apic_debug("target %p, source %p, dest 0x%x, " + "dest_mode 0x%x, short_hand 0x%x", + target, source, dest, dest_mode, short_hand); + + ASSERT(!target); + switch (short_hand) { + case APIC_DEST_NOSHORT: + if (dest_mode == 0) { + /* Physical mode. */ + if ((dest == 0xFF) || (dest == kvm_apic_id(target))) + result = 1; + } else + /* Logical mode. */ + result = kvm_apic_match_logical_addr(target, dest); + break; + case APIC_DEST_SELF: + if (target == source) + result = 1; + break; + case APIC_DEST_ALLINC: + result = 1; + break; + case APIC_DEST_ALLBUT: + if (target != source) + result = 1; + break; + default: + printk(KERN_WARNING "Bad dest shorthand value %x\n", + short_hand); + break; + } + + return result; +} + +/* + * Add a pending IRQ into lapic. + * Return 1 if successfully added and 0 if discarded. + */ +static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, + int vector, int level, int trig_mode) +{ + int orig_irr, result = 0; + struct kvm_vcpu *vcpu = apic->vcpu; + + switch (delivery_mode) { + case APIC_DM_FIXED: + case APIC_DM_LOWEST: + /* FIXME add logic for vcpu on reset */ + if (unlikely(!apic_enabled(apic))) + break; + + orig_irr = apic_test_and_set_irr(vector, apic); + if (orig_irr && trig_mode) { + apic_debug("level trig mode repeatedly for vector %d", + vector); + break; + } + + if (trig_mode) { + apic_debug("level trig mode for vector %d", vector); + apic_set_vector(vector, apic->regs + APIC_TMR); + } else + apic_clear_vector(vector, apic->regs + APIC_TMR); + + if (vcpu->mp_state == VCPU_MP_STATE_RUNNABLE) + kvm_vcpu_kick(vcpu); + else if (vcpu->mp_state == VCPU_MP_STATE_HALTED) { + vcpu->mp_state = VCPU_MP_STATE_RUNNABLE; + if (waitqueue_active(&vcpu->wq)) + wake_up_interruptible(&vcpu->wq); + } + + result = (orig_irr == 0); + break; + + case APIC_DM_REMRD: + printk(KERN_DEBUG "Ignoring delivery mode 3\n"); + break; + + case APIC_DM_SMI: + printk(KERN_DEBUG "Ignoring guest SMI\n"); + break; + case APIC_DM_NMI: + printk(KERN_DEBUG "Ignoring guest NMI\n"); + break; + + case APIC_DM_INIT: + if (level) { + if (vcpu->mp_state == VCPU_MP_STATE_RUNNABLE) + printk(KERN_DEBUG + "INIT on a runnable vcpu %d\n", + vcpu->vcpu_id); + vcpu->mp_state = VCPU_MP_STATE_INIT_RECEIVED; + kvm_vcpu_kick(vcpu); + } else { + printk(KERN_DEBUG + "Ignoring de-assert INIT to vcpu %d\n", + vcpu->vcpu_id); + } + + break; + + case APIC_DM_STARTUP: + printk(KERN_DEBUG "SIPI to vcpu %d vector 0x%02x\n", + vcpu->vcpu_id, vector); + if (vcpu->mp_state == VCPU_MP_STATE_INIT_RECEIVED) { + vcpu->sipi_vector = vector; + vcpu->mp_state = VCPU_MP_STATE_SIPI_RECEIVED; + if (waitqueue_active(&vcpu->wq)) + wake_up_interruptible(&vcpu->wq); + } + break; + + default: + printk(KERN_ERR "TODO: unsupported delivery mode %x\n", + delivery_mode); + break; + } + return result; +} + +struct kvm_lapic *kvm_apic_round_robin(struct kvm *kvm, u8 vector, + unsigned long bitmap) +{ + int last; + int next; + struct kvm_lapic *apic = NULL; + + last = kvm->round_robin_prev_vcpu; + next = last; + + do { + if (++next == KVM_MAX_VCPUS) + next = 0; + if (kvm->vcpus[next] == NULL || !test_bit(next, &bitmap)) + continue; + apic = kvm->vcpus[next]->apic; + if (apic && apic_enabled(apic)) + break; + apic = NULL; + } while (next != last); + kvm->round_robin_prev_vcpu = next; + + if (!apic) + printk(KERN_DEBUG "vcpu not ready for apic_round_robin\n"); + + return apic; +} + +static void apic_set_eoi(struct kvm_lapic *apic) +{ + int vector = apic_find_highest_isr(apic); + + /* + * Not every write EOI will has corresponding ISR, + * one example is when Kernel check timer on setup_IO_APIC + */ + if (vector == -1) + return; + + apic_clear_vector(vector, apic->regs + APIC_ISR); + apic_update_ppr(apic); + + if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR)) + kvm_ioapic_update_eoi(apic->vcpu->kvm, vector); +} + +static void apic_send_ipi(struct kvm_lapic *apic) +{ + u32 icr_low = apic_get_reg(apic, APIC_ICR); + u32 icr_high = apic_get_reg(apic, APIC_ICR2); + + unsigned int dest = GET_APIC_DEST_FIELD(icr_high); + unsigned int short_hand = icr_low & APIC_SHORT_MASK; + unsigned int trig_mode = icr_low & APIC_INT_LEVELTRIG; + unsigned int level = icr_low & APIC_INT_ASSERT; + unsigned int dest_mode = icr_low & APIC_DEST_MASK; + unsigned int delivery_mode = icr_low & APIC_MODE_MASK; + unsigned int vector = icr_low & APIC_VECTOR_MASK; + + struct kvm_lapic *target; + struct kvm_vcpu *vcpu; + unsigned long lpr_map = 0; + int i; + + apic_debug("icr_high 0x%x, icr_low 0x%x, " + "short_hand 0x%x, dest 0x%x, trig_mode 0x%x, level 0x%x, " + "dest_mode 0x%x, delivery_mode 0x%x, vector 0x%x\n", + icr_high, icr_low, short_hand, dest, + trig_mode, level, dest_mode, delivery_mode, vector); + + for (i = 0; i < KVM_MAX_VCPUS; i++) { + vcpu = apic->vcpu->kvm->vcpus[i]; + if (!vcpu) + continue; + + if (vcpu->apic && + apic_match_dest(vcpu, apic, short_hand, dest, dest_mode)) { + if (delivery_mode == APIC_DM_LOWEST) + set_bit(vcpu->vcpu_id, &lpr_map); + else + __apic_accept_irq(vcpu->apic, delivery_mode, + vector, level, trig_mode); + } + } + + if (delivery_mode == APIC_DM_LOWEST) { + target = kvm_apic_round_robin(vcpu->kvm, vector, lpr_map); + if (target != NULL) + __apic_accept_irq(target, delivery_mode, + vector, level, trig_mode); + } +} + +static u32 apic_get_tmcct(struct kvm_lapic *apic) +{ + u32 counter_passed; + ktime_t passed, now = apic->timer.dev.base->get_time(); + u32 tmcct = apic_get_reg(apic, APIC_TMICT); + + ASSERT(apic != NULL); + + if (unlikely(ktime_to_ns(now) <= + ktime_to_ns(apic->timer.last_update))) { + /* Wrap around */ + passed = ktime_add(( { + (ktime_t) { + .tv64 = KTIME_MAX - + (apic->timer.last_update).tv64}; } + ), now); + apic_debug("time elapsed\n"); + } else + passed = ktime_sub(now, apic->timer.last_update); + + counter_passed = div64_64(ktime_to_ns(passed), + (APIC_BUS_CYCLE_NS * apic->timer.divide_count)); + tmcct -= counter_passed; + + if (tmcct <= 0) { + if (unlikely(!apic_lvtt_period(apic))) + tmcct = 0; + else + do { + tmcct += apic_get_reg(apic, APIC_TMICT); + } while (tmcct <= 0); + } + + return tmcct; +} + +static u32 __apic_read(struct kvm_lapic *apic, unsigned int offset) +{ + u32 val = 0; + + if (offset >= LAPIC_MMIO_LENGTH) + return 0; + + switch (offset) { + case APIC_ARBPRI: + printk(KERN_WARNING "Access APIC ARBPRI register " + "which is for P6\n"); + break; + + case APIC_TMCCT: /* Timer CCR */ + val = apic_get_tmcct(apic); + break; + + default: + apic_update_ppr(apic); + val = apic_get_reg(apic, offset); + break; + } + + return val; +} + +static void apic_mmio_read(struct kvm_io_device *this, + gpa_t address, int len, void *data) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)this->private; + unsigned int offset = address - apic->base_address; + unsigned char alignment = offset & 0xf; + u32 result; + + if ((alignment + len) > 4) { + printk(KERN_ERR "KVM_APIC_READ: alignment error %lx %d", + (unsigned long)address, len); + return; + } + result = __apic_read(apic, offset & ~0xf); + + switch (len) { + case 1: + case 2: + case 4: + memcpy(data, (char *)&result + alignment, len); + break; + default: + printk(KERN_ERR "Local APIC read with len = %x, " + "should be 1,2, or 4 instead\n", len); + break; + } +} + +static void update_divide_count(struct kvm_lapic *apic) +{ + u32 tmp1, tmp2, tdcr; + + tdcr = apic_get_reg(apic, APIC_TDCR); + tmp1 = tdcr & 0xf; + tmp2 = ((tmp1 & 0x3) | ((tmp1 & 0x8) >> 1)) + 1; + apic->timer.divide_count = 0x1 << (tmp2 & 0x7); + + apic_debug("timer divide count is 0x%x\n", + apic->timer.divide_count); +} + +static void start_apic_timer(struct kvm_lapic *apic) +{ + ktime_t now = apic->timer.dev.base->get_time(); + + apic->timer.last_update = now; + + apic->timer.period = apic_get_reg(apic, APIC_TMICT) * + APIC_BUS_CYCLE_NS * apic->timer.divide_count; + atomic_set(&apic->timer.pending, 0); + hrtimer_start(&apic->timer.dev, + ktime_add_ns(now, apic->timer.period), + HRTIMER_MODE_ABS); + + apic_debug("%s: bus cycle is %" PRId64 "ns, now 0x%016" + PRIx64 ", " + "timer initial count 0x%x, period %lldns, " + "expire @ 0x%016" PRIx64 ".\n", __FUNCTION__, + APIC_BUS_CYCLE_NS, ktime_to_ns(now), + apic_get_reg(apic, APIC_TMICT), + apic->timer.period, + ktime_to_ns(ktime_add_ns(now, + apic->timer.period))); +} + +static void apic_mmio_write(struct kvm_io_device *this, + gpa_t address, int len, const void *data) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)this->private; + unsigned int offset = address - apic->base_address; + unsigned char alignment = offset & 0xf; + u32 val; + + /* + * APIC register must be aligned on 128-bits boundary. + * 32/64/128 bits registers must be accessed thru 32 bits. + * Refer SDM 8.4.1 + */ + if (len != 4 || alignment) { + if (printk_ratelimit()) + printk(KERN_ERR "apic write: bad size=%d %lx\n", + len, (long)address); + return; + } + + val = *(u32 *) data; + + /* too common printing */ + if (offset != APIC_EOI) + apic_debug("%s: offset 0x%x with length 0x%x, and value is " + "0x%x\n", __FUNCTION__, offset, len, val); + + offset &= 0xff0; + + switch (offset) { + case APIC_ID: /* Local APIC ID */ + apic_set_reg(apic, APIC_ID, val); + break; + + case APIC_TASKPRI: + apic_set_tpr(apic, val & 0xff); + break; + + case APIC_EOI: + apic_set_eoi(apic); + break; + + case APIC_LDR: + apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK); + break; + + case APIC_DFR: + apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF); + break; + + case APIC_SPIV: + apic_set_reg(apic, APIC_SPIV, val & 0x3ff); + if (!(val & APIC_SPIV_APIC_ENABLED)) { + int i; + u32 lvt_val; + + for (i = 0; i < APIC_LVT_NUM; i++) { + lvt_val = apic_get_reg(apic, + APIC_LVTT + 0x10 * i); + apic_set_reg(apic, APIC_LVTT + 0x10 * i, + lvt_val | APIC_LVT_MASKED); + } + atomic_set(&apic->timer.pending, 0); + + } + break; + + case APIC_ICR: + /* No delay here, so we always clear the pending bit */ + apic_set_reg(apic, APIC_ICR, val & ~(1 << 12)); + apic_send_ipi(apic); + break; + + case APIC_ICR2: + apic_set_reg(apic, APIC_ICR2, val & 0xff000000); + break; + + case APIC_LVTT: + case APIC_LVTTHMR: + case APIC_LVTPC: + case APIC_LVT0: + case APIC_LVT1: + case APIC_LVTERR: + /* TODO: Check vector */ + if (!apic_sw_enabled(apic)) + val |= APIC_LVT_MASKED; + + val &= apic_lvt_mask[(offset - APIC_LVTT) >> 4]; + apic_set_reg(apic, offset, val); + + break; + + case APIC_TMICT: + hrtimer_cancel(&apic->timer.dev); + apic_set_reg(apic, APIC_TMICT, val); + start_apic_timer(apic); + return; + + case APIC_TDCR: + if (val & 4) + printk(KERN_ERR "KVM_WRITE:TDCR %x\n", val); + apic_set_reg(apic, APIC_TDCR, val); + update_divide_count(apic); + break; + + default: + apic_debug("Local APIC Write to read-only register %x\n", + offset); + break; + } + +} + +static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)this->private; + int ret = 0; + + + if (apic_hw_enabled(apic) && + (addr >= apic->base_address) && + (addr < (apic->base_address + LAPIC_MMIO_LENGTH))) + ret = 1; + + return ret; +} + +void kvm_free_apic(struct kvm_lapic *apic) +{ + if (!apic) + return; + + hrtimer_cancel(&apic->timer.dev); + + if (apic->regs_page) { + __free_page(apic->regs_page); + apic->regs_page = 0; + } + + kfree(apic); +} + +/* + *---------------------------------------------------------------------- + * LAPIC interface + *---------------------------------------------------------------------- + */ + +void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)vcpu->apic; + + if (!apic) + return; + apic_set_tpr(apic, ((cr8 & 0x0f) << 4)); +} + +u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)vcpu->apic; + u64 tpr; + + if (!apic) + return 0; + tpr = (u64) apic_get_reg(apic, APIC_TASKPRI); + + return (tpr & 0xf0) >> 4; +} +EXPORT_SYMBOL_GPL(kvm_lapic_get_cr8); + +void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)vcpu->apic; + + if (!apic) { + value |= MSR_IA32_APICBASE_BSP; + vcpu->apic_base = value; + return; + } + if (apic->vcpu->vcpu_id) + value &= ~MSR_IA32_APICBASE_BSP; + + vcpu->apic_base = value; + apic->base_address = apic->vcpu->apic_base & + MSR_IA32_APICBASE_BASE; + + /* with FSB delivery interrupt, we can restart APIC functionality */ + apic_debug("apic base msr is 0x%016" PRIx64 ", and base address is " + "0x%lx.\n", apic->apic_base, apic->base_address); + +} + +u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu) +{ + return vcpu->apic_base; +} +EXPORT_SYMBOL_GPL(kvm_lapic_get_base); + +void kvm_lapic_reset(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic; + int i; + + apic_debug("%s\n", __FUNCTION__); + + ASSERT(vcpu); + apic = vcpu->apic; + ASSERT(apic != NULL); + + /* Stop the timer in case it's a reset to an active apic */ + hrtimer_cancel(&apic->timer.dev); + + apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24); + apic_set_reg(apic, APIC_LVR, APIC_VERSION); + + for (i = 0; i < APIC_LVT_NUM; i++) + apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED); + apic_set_reg(apic, APIC_LVT0, + SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); + + apic_set_reg(apic, APIC_DFR, 0xffffffffU); + apic_set_reg(apic, APIC_SPIV, 0xff); + apic_set_reg(apic, APIC_TASKPRI, 0); + apic_set_reg(apic, APIC_LDR, 0); + apic_set_reg(apic, APIC_ESR, 0); + apic_set_reg(apic, APIC_ICR, 0); + apic_set_reg(apic, APIC_ICR2, 0); + apic_set_reg(apic, APIC_TDCR, 0); + apic_set_reg(apic, APIC_TMICT, 0); + for (i = 0; i < 8; i++) { + apic_set_reg(apic, APIC_IRR + 0x10 * i, 0); + apic_set_reg(apic, APIC_ISR + 0x10 * i, 0); + apic_set_reg(apic, APIC_TMR + 0x10 * i, 0); + } + apic->timer.divide_count = 0; + atomic_set(&apic->timer.pending, 0); + if (vcpu->vcpu_id == 0) + vcpu->apic_base |= MSR_IA32_APICBASE_BSP; + apic_update_ppr(apic); + + apic_debug(KERN_INFO "%s: vcpu=%p, id=%d, base_msr=" + "0x%016" PRIx64 ", base_address=0x%0lx.\n", __FUNCTION__, + vcpu, kvm_apic_id(apic), + vcpu->apic_base, apic->base_address); +} +EXPORT_SYMBOL_GPL(kvm_lapic_reset); + +int kvm_lapic_enabled(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = (struct kvm_lapic *)vcpu->apic; + int ret = 0; + + if (!apic) + return 0; + ret = apic_enabled(apic); + + return ret; +} +EXPORT_SYMBOL_GPL(kvm_lapic_enabled); + +/* + *---------------------------------------------------------------------- + * timer interface + *---------------------------------------------------------------------- + */ + +/* TODO: make sure __apic_timer_fn runs in current pCPU */ +static int __apic_timer_fn(struct kvm_lapic *apic) +{ + int result = 0; + wait_queue_head_t *q = &apic->vcpu->wq; + + atomic_inc(&apic->timer.pending); + if (waitqueue_active(q)) + { + apic->vcpu->mp_state = VCPU_MP_STATE_RUNNABLE; + wake_up_interruptible(q); + } + if (apic_lvtt_period(apic)) { + result = 1; + apic->timer.dev.expires = ktime_add_ns( + apic->timer.dev.expires, + apic->timer.period); + } + return result; +} + +static int __inject_apic_timer_irq(struct kvm_lapic *apic) +{ + int vector; + + vector = apic_lvt_vector(apic, APIC_LVTT); + return __apic_accept_irq(apic, APIC_DM_FIXED, vector, 1, 0); +} + +static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) +{ + struct kvm_lapic *apic; + int restart_timer = 0; + + apic = container_of(data, struct kvm_lapic, timer.dev); + + restart_timer = __apic_timer_fn(apic); + + if (restart_timer) + return HRTIMER_RESTART; + else + return HRTIMER_NORESTART; +} + +int kvm_create_lapic(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic; + + ASSERT(vcpu != NULL); + apic_debug("apic_init %d\n", vcpu->vcpu_id); + + apic = kzalloc(sizeof(*apic), GFP_KERNEL); + if (!apic) + goto nomem; + + vcpu->apic = apic; + + apic->regs_page = alloc_page(GFP_KERNEL); + if (apic->regs_page == NULL) { + printk(KERN_ERR "malloc apic regs error for vcpu %x\n", + vcpu->vcpu_id); + goto nomem; + } + apic->regs = page_address(apic->regs_page); + memset(apic->regs, 0, PAGE_SIZE); + apic->vcpu = vcpu; + + hrtimer_init(&apic->timer.dev, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); + apic->timer.dev.function = apic_timer_fn; + apic->base_address = APIC_DEFAULT_PHYS_BASE; + vcpu->apic_base = APIC_DEFAULT_PHYS_BASE; + + kvm_lapic_reset(vcpu); + apic->dev.read = apic_mmio_read; + apic->dev.write = apic_mmio_write; + apic->dev.in_range = apic_mmio_range; + apic->dev.private = apic; + + return 0; +nomem: + kvm_free_apic(apic); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(kvm_create_lapic); + +int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->apic; + int highest_irr; + + if (!apic || !apic_enabled(apic)) + return -1; + + apic_update_ppr(apic); + highest_irr = apic_find_highest_irr(apic); + if ((highest_irr == -1) || + ((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI))) + return -1; + return highest_irr; +} + +int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu) +{ + u32 lvt0 = apic_get_reg(vcpu->apic, APIC_LVT0); + int r = 0; + + if (vcpu->vcpu_id == 0) { + if (!apic_hw_enabled(vcpu->apic)) + r = 1; + if ((lvt0 & APIC_LVT_MASKED) == 0 && + GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT) + r = 1; + } + return r; +} + +void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->apic; + + if (apic && apic_lvt_enabled(apic, APIC_LVTT) && + atomic_read(&apic->timer.pending) > 0) { + if (__inject_apic_timer_irq(apic)) + atomic_dec(&apic->timer.pending); + } +} + +void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) +{ + struct kvm_lapic *apic = vcpu->apic; + + if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) + apic->timer.last_update = ktime_add_ns( + apic->timer.last_update, + apic->timer.period); +} + +int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) +{ + int vector = kvm_apic_has_interrupt(vcpu); + struct kvm_lapic *apic = vcpu->apic; + + if (vector == -1) + return -1; + + apic_set_vector(vector, apic->regs + APIC_ISR); + apic_update_ppr(apic); + apic_clear_irr(vector, apic); + return vector; +} + +void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->apic; + + apic->base_address = vcpu->apic_base & + MSR_IA32_APICBASE_BASE; + apic_set_reg(apic, APIC_LVR, APIC_VERSION); + apic_update_ppr(apic); + hrtimer_cancel(&apic->timer.dev); + update_divide_count(apic); + start_apic_timer(apic); +} + +void kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->apic; + struct hrtimer *timer; + + if (!apic) + return; + + timer = &apic->timer.dev; + if (hrtimer_cancel(timer)) + hrtimer_start(timer, timer->expires, HRTIMER_MODE_ABS); +} +EXPORT_SYMBOL_GPL(kvm_migrate_apic_timer); + diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c index 23965aa..17f5735 100644 --- a/drivers/kvm/mmu.c +++ b/drivers/kvm/mmu.c @@ -156,9 +156,19 @@ static struct kmem_cache *pte_chain_cach static struct kmem_cache *rmap_desc_cache; static struct kmem_cache *mmu_page_header_cache; +static u64 __read_mostly shadow_trap_nonpresent_pte; +static u64 __read_mostly shadow_notrap_nonpresent_pte; + +void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) +{ + shadow_trap_nonpresent_pte = trap_pte; + shadow_notrap_nonpresent_pte = notrap_pte; +} +EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes); + static int is_write_protection(struct kvm_vcpu *vcpu) { - return vcpu->cr0 & CR0_WP_MASK; + return vcpu->cr0 & X86_CR0_WP; } static int is_cpuid_PSE36(void) @@ -176,6 +186,13 @@ static int is_present_pte(unsigned long return pte & PT_PRESENT_MASK; } +static int is_shadow_present_pte(u64 pte) +{ + pte &= ~PT_SHADOW_IO_MARK; + return pte != shadow_trap_nonpresent_pte + && pte != shadow_notrap_nonpresent_pte; +} + static int is_writeble_pte(unsigned long pte) { return pte & PT_WRITABLE_MASK; @@ -202,15 +219,14 @@ #endif } static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, - struct kmem_cache *base_cache, int min, - gfp_t gfp_flags) + struct kmem_cache *base_cache, int min) { void *obj; if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) { - obj = kmem_cache_zalloc(base_cache, gfp_flags); + obj = kmem_cache_zalloc(base_cache, GFP_KERNEL); if (!obj) return -ENOMEM; cache->objects[cache->nobjs++] = obj; @@ -225,14 +241,14 @@ static void mmu_free_memory_cache(struct } static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache, - int min, gfp_t gfp_flags) + int min) { struct page *page; if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) { - page = alloc_page(gfp_flags); + page = alloc_page(GFP_KERNEL); if (!page) return -ENOMEM; set_page_private(page, 0); @@ -247,44 +263,28 @@ static void mmu_free_memory_cache_page(s free_page((unsigned long)mc->objects[--mc->nobjs]); } -static int __mmu_topup_memory_caches(struct kvm_vcpu *vcpu, gfp_t gfp_flags) +static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu) { int r; + kvm_mmu_free_some_pages(vcpu); r = mmu_topup_memory_cache(&vcpu->mmu_pte_chain_cache, - pte_chain_cache, 4, gfp_flags); + pte_chain_cache, 4); if (r) goto out; r = mmu_topup_memory_cache(&vcpu->mmu_rmap_desc_cache, - rmap_desc_cache, 1, gfp_flags); + rmap_desc_cache, 1); if (r) goto out; - r = mmu_topup_memory_cache_page(&vcpu->mmu_page_cache, 4, gfp_flags); + r = mmu_topup_memory_cache_page(&vcpu->mmu_page_cache, 4); if (r) goto out; r = mmu_topup_memory_cache(&vcpu->mmu_page_header_cache, - mmu_page_header_cache, 4, gfp_flags); + mmu_page_header_cache, 4); out: return r; } -static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu) -{ - int r; - - r = __mmu_topup_memory_caches(vcpu, GFP_NOWAIT); - kvm_mmu_free_some_pages(vcpu); - if (r < 0) { - spin_unlock(&vcpu->kvm->lock); - kvm_arch_ops->vcpu_put(vcpu); - r = __mmu_topup_memory_caches(vcpu, GFP_KERNEL); - kvm_arch_ops->vcpu_load(vcpu); - spin_lock(&vcpu->kvm->lock); - kvm_mmu_free_some_pages(vcpu); - } - return r; -} - static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) { mmu_free_memory_cache(&vcpu->mmu_pte_chain_cache); @@ -467,7 +467,7 @@ static int is_empty_shadow_page(u64 *spt u64 *end; for (pos = spt, end = pos + PAGE_SIZE / sizeof(u64); pos != end; pos++) - if (*pos != 0) { + if ((*pos & ~PT_SHADOW_IO_MARK) != shadow_trap_nonpresent_pte) { printk(KERN_ERR "%s: %p %llx\n", __FUNCTION__, pos, *pos); return 0; @@ -649,6 +649,7 @@ static struct kvm_mmu_page *kvm_mmu_get_ page->gfn = gfn; page->role = role; hlist_add_head(&page->hash_link, bucket); + vcpu->mmu.prefetch_page(vcpu, page); if (!metaphysical) rmap_write_protect(vcpu, gfn); return page; @@ -665,9 +666,9 @@ static void kvm_mmu_page_unlink_children if (page->role.level == PT_PAGE_TABLE_LEVEL) { for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { - if (pt[i] & PT_PRESENT_MASK) + if (is_shadow_present_pte(pt[i])) rmap_remove(&pt[i]); - pt[i] = 0; + pt[i] = shadow_trap_nonpresent_pte; } kvm_flush_remote_tlbs(kvm); return; @@ -676,8 +677,8 @@ static void kvm_mmu_page_unlink_children for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { ent = pt[i]; - pt[i] = 0; - if (!(ent & PT_PRESENT_MASK)) + pt[i] = shadow_trap_nonpresent_pte; + if (!is_shadow_present_pte(ent)) continue; ent &= PT64_BASE_ADDR_MASK; mmu_page_remove_parent_pte(page_header(ent), &pt[i]); @@ -691,6 +692,15 @@ static void kvm_mmu_put_page(struct kvm_ mmu_page_remove_parent_pte(page, parent_pte); } +static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm) +{ + int i; + + for (i = 0; i < KVM_MAX_VCPUS; ++i) + if (kvm->vcpus[i]) + kvm->vcpus[i]->last_pte_updated = NULL; +} + static void kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *page) { @@ -708,7 +718,7 @@ static void kvm_mmu_zap_page(struct kvm } BUG_ON(!parent_pte); kvm_mmu_put_page(page, parent_pte); - set_shadow_pte(parent_pte, 0); + set_shadow_pte(parent_pte, shadow_trap_nonpresent_pte); } kvm_mmu_page_unlink_children(kvm, page); if (!page->root_count) { @@ -716,6 +726,7 @@ static void kvm_mmu_zap_page(struct kvm kvm_mmu_free_page(kvm, page); } else list_move(&page->link, &kvm->active_mmu_pages); + kvm_mmu_reset_last_pte_updated(kvm); } static int kvm_mmu_unprotect_page(struct kvm_vcpu *vcpu, gfn_t gfn) @@ -815,7 +826,7 @@ static int nonpaging_map(struct kvm_vcpu if (level == 1) { pte = table[index]; - if (is_present_pte(pte) && is_writeble_pte(pte)) + if (is_shadow_present_pte(pte) && is_writeble_pte(pte)) return 0; mark_page_dirty(vcpu->kvm, v >> PAGE_SHIFT); page_header_update_slot(vcpu->kvm, table, v); @@ -825,7 +836,7 @@ static int nonpaging_map(struct kvm_vcpu return 0; } - if (table[index] == 0) { + if (table[index] == shadow_trap_nonpresent_pte) { struct kvm_mmu_page *new_table; gfn_t pseudo_gfn; @@ -846,6 +857,15 @@ static int nonpaging_map(struct kvm_vcpu } } +static void nonpaging_prefetch_page(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp) +{ + int i; + + for (i = 0; i < PT64_ENT_PER_PAGE; ++i) + sp->spt[i] = shadow_trap_nonpresent_pte; +} + static void mmu_free_roots(struct kvm_vcpu *vcpu) { int i; @@ -960,6 +980,7 @@ static int nonpaging_init_context(struct context->page_fault = nonpaging_page_fault; context->gva_to_gpa = nonpaging_gva_to_gpa; context->free = nonpaging_free; + context->prefetch_page = nonpaging_prefetch_page; context->root_level = 0; context->shadow_root_level = PT32E_ROOT_LEVEL; context->root_hpa = INVALID_PAGE; @@ -969,7 +990,7 @@ static int nonpaging_init_context(struct static void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu) { ++vcpu->stat.tlb_flush; - kvm_arch_ops->tlb_flush(vcpu); + kvm_x86_ops->tlb_flush(vcpu); } static void paging_new_cr3(struct kvm_vcpu *vcpu) @@ -982,7 +1003,7 @@ static void inject_page_fault(struct kvm u64 addr, u32 err_code) { - kvm_arch_ops->inject_page_fault(vcpu, addr, err_code); + kvm_x86_ops->inject_page_fault(vcpu, addr, err_code); } static void paging_free(struct kvm_vcpu *vcpu) @@ -1006,6 +1027,7 @@ static int paging64_init_context_common( context->new_cr3 = paging_new_cr3; context->page_fault = paging64_page_fault; context->gva_to_gpa = paging64_gva_to_gpa; + context->prefetch_page = paging64_prefetch_page; context->free = paging_free; context->root_level = level; context->shadow_root_level = level; @@ -1026,6 +1048,7 @@ static int paging32_init_context(struct context->page_fault = paging32_page_fault; context->gva_to_gpa = paging32_gva_to_gpa; context->free = paging_free; + context->prefetch_page = paging32_prefetch_page; context->root_level = PT32_ROOT_LEVEL; context->shadow_root_level = PT32E_ROOT_LEVEL; context->root_hpa = INVALID_PAGE; @@ -1071,15 +1094,15 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) { int r; - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); r = mmu_topup_memory_caches(vcpu); if (r) goto out; mmu_alloc_roots(vcpu); - kvm_arch_ops->set_cr3(vcpu, vcpu->mmu.root_hpa); + kvm_x86_ops->set_cr3(vcpu, vcpu->mmu.root_hpa); kvm_mmu_flush_tlb(vcpu); out: - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); return r; } EXPORT_SYMBOL_GPL(kvm_mmu_load); @@ -1097,7 +1120,7 @@ static void mmu_pte_write_zap_pte(struct struct kvm_mmu_page *child; pte = *spte; - if (is_present_pte(pte)) { + if (is_shadow_present_pte(pte)) { if (page->role.level == PT_PAGE_TABLE_LEVEL) rmap_remove(spte); else { @@ -1105,26 +1128,36 @@ static void mmu_pte_write_zap_pte(struct mmu_page_remove_parent_pte(child, spte); } } - *spte = 0; + set_shadow_pte(spte, shadow_trap_nonpresent_pte); kvm_flush_remote_tlbs(vcpu->kvm); } static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, u64 *spte, - const void *new, int bytes) + const void *new, int bytes, + int offset_in_pte) { if (page->role.level != PT_PAGE_TABLE_LEVEL) return; if (page->role.glevels == PT32_ROOT_LEVEL) - paging32_update_pte(vcpu, page, spte, new, bytes); + paging32_update_pte(vcpu, page, spte, new, bytes, + offset_in_pte); else - paging64_update_pte(vcpu, page, spte, new, bytes); + paging64_update_pte(vcpu, page, spte, new, bytes, + offset_in_pte); +} + +static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) +{ + u64 *spte = vcpu->last_pte_updated; + + return !!(spte && (*spte & PT_ACCESSED_MASK)); } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *old, const u8 *new, int bytes) + const u8 *new, int bytes) { gfn_t gfn = gpa >> PAGE_SHIFT; struct kvm_mmu_page *page; @@ -1142,13 +1175,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu * int npte; pgprintk("%s: gpa %llx bytes %d\n", __FUNCTION__, gpa, bytes); - if (gfn == vcpu->last_pt_write_gfn) { + kvm_mmu_audit(vcpu, "pre pte write"); + if (gfn == vcpu->last_pt_write_gfn + && !last_updated_pte_accessed(vcpu)) { ++vcpu->last_pt_write_count; if (vcpu->last_pt_write_count >= 3) flooded = 1; } else { vcpu->last_pt_write_gfn = gfn; vcpu->last_pt_write_count = 1; + vcpu->last_pte_updated = NULL; } index = kvm_page_table_hashfn(gfn) % KVM_NUM_MMU_PAGES; bucket = &vcpu->kvm->mmu_page_hash[index]; @@ -1197,10 +1233,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu * spte = &page->spt[page_offset / sizeof(*spte)]; while (npte--) { mmu_pte_write_zap_pte(vcpu, page, spte); - mmu_pte_write_new_pte(vcpu, page, spte, new, bytes); + mmu_pte_write_new_pte(vcpu, page, spte, new, bytes, + page_offset & (pte_size - 1)); ++spte; } } + kvm_mmu_audit(vcpu, "post pte write"); } int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) @@ -1375,22 +1413,33 @@ static void audit_mappings_page(struct k for (i = 0; i < PT64_ENT_PER_PAGE; ++i, va += va_delta) { u64 ent = pt[i]; - if (!(ent & PT_PRESENT_MASK)) + if (ent == shadow_trap_nonpresent_pte) continue; va = canonicalize(va); - if (level > 1) + if (level > 1) { + if (ent == shadow_notrap_nonpresent_pte) + printk(KERN_ERR "audit: (%s) nontrapping pte" + " in nonleaf level: levels %d gva %lx" + " level %d pte %llx\n", audit_msg, + vcpu->mmu.root_level, va, level, ent); + audit_mappings_page(vcpu, ent, va, level - 1); - else { + } else { gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, va); hpa_t hpa = gpa_to_hpa(vcpu, gpa); - if ((ent & PT_PRESENT_MASK) + if (is_shadow_present_pte(ent) && (ent & PT64_BASE_ADDR_MASK) != hpa) - printk(KERN_ERR "audit error: (%s) levels %d" - " gva %lx gpa %llx hpa %llx ent %llx\n", + printk(KERN_ERR "xx audit error: (%s) levels %d" + " gva %lx gpa %llx hpa %llx ent %llx %d\n", audit_msg, vcpu->mmu.root_level, - va, gpa, hpa, ent); + va, gpa, hpa, ent, is_shadow_present_pte(ent)); + else if (ent == shadow_notrap_nonpresent_pte + && !is_error_hpa(hpa)) + printk(KERN_ERR "audit: (%s) notrap shadow," + " valid guest gva %lx\n", audit_msg, va); + } } } diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h index 4b5391c..be0f852 100644 --- a/drivers/kvm/paging_tmpl.h +++ b/drivers/kvm/paging_tmpl.h @@ -31,6 +31,7 @@ #if PTTYPE == 64 #define PT_INDEX(addr, level) PT64_INDEX(addr, level) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) #define PT_LEVEL_MASK(level) PT64_LEVEL_MASK(level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS #ifdef CONFIG_X86_64 #define PT_MAX_FULL_LEVELS 4 #else @@ -45,6 +46,7 @@ #elif PTTYPE == 32 #define PT_INDEX(addr, level) PT32_INDEX(addr, level) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level) + #define PT_LEVEL_BITS PT32_LEVEL_BITS #define PT_MAX_FULL_LEVELS 2 #else #error Invalid PTTYPE value @@ -58,7 +60,10 @@ struct guest_walker { int level; gfn_t table_gfn[PT_MAX_FULL_LEVELS]; pt_element_t *table; + pt_element_t pte; pt_element_t *ptep; + struct page *page; + int index; pt_element_t inherited_ar; gfn_t gfn; u32 error_code; @@ -80,11 +85,14 @@ static int FNAME(walk_addr)(struct guest pgprintk("%s: addr %lx\n", __FUNCTION__, addr); walker->level = vcpu->mmu.root_level; walker->table = NULL; + walker->page = NULL; + walker->ptep = NULL; root = vcpu->cr3; #if PTTYPE == 64 if (!is_long_mode(vcpu)) { walker->ptep = &vcpu->pdptrs[(addr >> 30) & 3]; root = *walker->ptep; + walker->pte = root; if (!(root & PT_PRESENT_MASK)) goto not_present; --walker->level; @@ -96,10 +104,11 @@ #endif walker->level - 1, table_gfn); slot = gfn_to_memslot(vcpu->kvm, table_gfn); hpa = safe_gpa_to_hpa(vcpu, root & PT64_BASE_ADDR_MASK); - walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0); + walker->page = pfn_to_page(hpa >> PAGE_SHIFT); + walker->table = kmap_atomic(walker->page, KM_USER0); ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) || - (vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0); + (vcpu->cr3 & CR3_NONPAE_RESERVED_BITS) == 0); walker->inherited_ar = PT_USER_MASK | PT_WRITABLE_MASK; @@ -108,6 +117,7 @@ #endif hpa_t paddr; ptep = &walker->table[index]; + walker->index = index; ASSERT(((unsigned long)walker->table & PAGE_MASK) == ((unsigned long)ptep & PAGE_MASK)); @@ -148,16 +158,20 @@ #endif walker->inherited_ar &= walker->table[index]; table_gfn = (*ptep & PT_BASE_ADDR_MASK) >> PAGE_SHIFT; - paddr = safe_gpa_to_hpa(vcpu, *ptep & PT_BASE_ADDR_MASK); kunmap_atomic(walker->table, KM_USER0); - walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT), - KM_USER0); + paddr = safe_gpa_to_hpa(vcpu, table_gfn << PAGE_SHIFT); + walker->page = pfn_to_page(paddr >> PAGE_SHIFT); + walker->table = kmap_atomic(walker->page, KM_USER0); --walker->level; walker->table_gfn[walker->level - 1 ] = table_gfn; pgprintk("%s: table_gfn[%d] %lx\n", __FUNCTION__, walker->level - 1, table_gfn); } - walker->ptep = ptep; + walker->pte = *ptep; + if (walker->page) + walker->ptep = NULL; + if (walker->table) + kunmap_atomic(walker->table, KM_USER0); pgprintk("%s: pte %llx\n", __FUNCTION__, (u64)*ptep); return 1; @@ -175,13 +189,9 @@ err: walker->error_code |= PFERR_USER_MASK; if (fetch_fault) walker->error_code |= PFERR_FETCH_MASK; - return 0; -} - -static void FNAME(release_walker)(struct guest_walker *walker) -{ if (walker->table) kunmap_atomic(walker->table, KM_USER0); + return 0; } static void FNAME(mark_pagetable_dirty)(struct kvm *kvm, @@ -193,7 +203,7 @@ static void FNAME(mark_pagetable_dirty)( static void FNAME(set_pte_common)(struct kvm_vcpu *vcpu, u64 *shadow_pte, gpa_t gaddr, - pt_element_t *gpte, + pt_element_t gpte, u64 access_bits, int user_fault, int write_fault, @@ -202,23 +212,39 @@ static void FNAME(set_pte_common)(struct gfn_t gfn) { hpa_t paddr; - int dirty = *gpte & PT_DIRTY_MASK; - u64 spte = *shadow_pte; - int was_rmapped = is_rmap_pte(spte); + int dirty = gpte & PT_DIRTY_MASK; + u64 spte; + int was_rmapped = is_rmap_pte(*shadow_pte); pgprintk("%s: spte %llx gpte %llx access %llx write_fault %d" " user_fault %d gfn %lx\n", - __FUNCTION__, spte, (u64)*gpte, access_bits, + __FUNCTION__, *shadow_pte, (u64)gpte, access_bits, write_fault, user_fault, gfn); if (write_fault && !dirty) { - *gpte |= PT_DIRTY_MASK; + pt_element_t *guest_ent, *tmp = NULL; + + if (walker->ptep) + guest_ent = walker->ptep; + else { + tmp = kmap_atomic(walker->page, KM_USER0); + guest_ent = &tmp[walker->index]; + } + + *guest_ent |= PT_DIRTY_MASK; + if (!walker->ptep) + kunmap_atomic(tmp, KM_USER0); dirty = 1; FNAME(mark_pagetable_dirty)(vcpu->kvm, walker); } - spte |= PT_PRESENT_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK; - spte |= *gpte & PT64_NX_MASK; + /* + * We don't set the accessed bit, since we sometimes want to see + * whether the guest actually used the pte (in order to detect + * demand paging). + */ + spte = PT_PRESENT_MASK | PT_DIRTY_MASK; + spte |= gpte & PT64_NX_MASK; if (!dirty) access_bits &= ~PT_WRITABLE_MASK; @@ -229,10 +255,8 @@ static void FNAME(set_pte_common)(struct spte |= PT_USER_MASK; if (is_error_hpa(paddr)) { - spte |= gaddr; - spte |= PT_SHADOW_IO_MARK; - spte &= ~PT_PRESENT_MASK; - set_shadow_pte(shadow_pte, spte); + set_shadow_pte(shadow_pte, + shadow_trap_nonpresent_pte | PT_SHADOW_IO_MARK); return; } @@ -255,7 +279,7 @@ static void FNAME(set_pte_common)(struct access_bits &= ~PT_WRITABLE_MASK; if (is_writeble_pte(spte)) { spte &= ~PT_WRITABLE_MASK; - kvm_arch_ops->tlb_flush(vcpu); + kvm_x86_ops->tlb_flush(vcpu); } if (write_fault) *ptwrite = 1; @@ -267,50 +291,57 @@ unshadowed: if (access_bits & PT_WRITABLE_MASK) mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT); + pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte); set_shadow_pte(shadow_pte, spte); page_header_update_slot(vcpu->kvm, shadow_pte, gaddr); if (!was_rmapped) rmap_add(vcpu, shadow_pte); + if (!ptwrite || !*ptwrite) + vcpu->last_pte_updated = shadow_pte; } -static void FNAME(set_pte)(struct kvm_vcpu *vcpu, pt_element_t *gpte, +static void FNAME(set_pte)(struct kvm_vcpu *vcpu, pt_element_t gpte, u64 *shadow_pte, u64 access_bits, int user_fault, int write_fault, int *ptwrite, struct guest_walker *walker, gfn_t gfn) { - access_bits &= *gpte; - FNAME(set_pte_common)(vcpu, shadow_pte, *gpte & PT_BASE_ADDR_MASK, + access_bits &= gpte; + FNAME(set_pte_common)(vcpu, shadow_pte, gpte & PT_BASE_ADDR_MASK, gpte, access_bits, user_fault, write_fault, ptwrite, walker, gfn); } static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, - u64 *spte, const void *pte, int bytes) + u64 *spte, const void *pte, int bytes, + int offset_in_pte) { pt_element_t gpte; - if (bytes < sizeof(pt_element_t)) - return; gpte = *(const pt_element_t *)pte; - if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) + if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) { + if (!offset_in_pte && !is_present_pte(gpte)) + set_shadow_pte(spte, shadow_notrap_nonpresent_pte); + return; + } + if (bytes < sizeof(pt_element_t)) return; pgprintk("%s: gpte %llx spte %p\n", __FUNCTION__, (u64)gpte, spte); - FNAME(set_pte)(vcpu, &gpte, spte, PT_USER_MASK | PT_WRITABLE_MASK, 0, + FNAME(set_pte)(vcpu, gpte, spte, PT_USER_MASK | PT_WRITABLE_MASK, 0, 0, NULL, NULL, (gpte & PT_BASE_ADDR_MASK) >> PAGE_SHIFT); } -static void FNAME(set_pde)(struct kvm_vcpu *vcpu, pt_element_t *gpde, +static void FNAME(set_pde)(struct kvm_vcpu *vcpu, pt_element_t gpde, u64 *shadow_pte, u64 access_bits, int user_fault, int write_fault, int *ptwrite, struct guest_walker *walker, gfn_t gfn) { gpa_t gaddr; - access_bits &= *gpde; + access_bits &= gpde; gaddr = (gpa_t)gfn << PAGE_SHIFT; if (PTTYPE == 32 && is_cpuid_PSE36()) - gaddr |= (*gpde & PT32_DIR_PSE36_MASK) << + gaddr |= (gpde & PT32_DIR_PSE36_MASK) << (32 - PT32_DIR_PSE36_SHIFT); FNAME(set_pte_common)(vcpu, shadow_pte, gaddr, gpde, access_bits, user_fault, write_fault, @@ -328,9 +359,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu int level; u64 *shadow_ent; u64 *prev_shadow_ent = NULL; - pt_element_t *guest_ent = walker->ptep; - if (!is_present_pte(*guest_ent)) + if (!is_present_pte(walker->pte)) return NULL; shadow_addr = vcpu->mmu.root_hpa; @@ -350,7 +380,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu unsigned hugepage_access = 0; shadow_ent = ((u64 *)__va(shadow_addr)) + index; - if (is_present_pte(*shadow_ent) || is_io_pte(*shadow_ent)) { + if (is_shadow_present_pte(*shadow_ent)) { if (level == PT_PAGE_TABLE_LEVEL) break; shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK; @@ -364,12 +394,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu if (level - 1 == PT_PAGE_TABLE_LEVEL && walker->level == PT_DIRECTORY_LEVEL) { metaphysical = 1; - hugepage_access = *guest_ent; + hugepage_access = walker->pte; hugepage_access &= PT_USER_MASK | PT_WRITABLE_MASK; - if (*guest_ent & PT64_NX_MASK) + if (walker->pte & PT64_NX_MASK) hugepage_access |= (1 << 2); hugepage_access >>= PT_WRITABLE_SHIFT; - table_gfn = (*guest_ent & PT_BASE_ADDR_MASK) + table_gfn = (walker->pte & PT_BASE_ADDR_MASK) >> PAGE_SHIFT; } else { metaphysical = 0; @@ -386,12 +416,12 @@ static u64 *FNAME(fetch)(struct kvm_vcpu } if (walker->level == PT_DIRECTORY_LEVEL) { - FNAME(set_pde)(vcpu, guest_ent, shadow_ent, + FNAME(set_pde)(vcpu, walker->pte, shadow_ent, walker->inherited_ar, user_fault, write_fault, ptwrite, walker, walker->gfn); } else { ASSERT(walker->level == PT_PAGE_TABLE_LEVEL); - FNAME(set_pte)(vcpu, guest_ent, shadow_ent, + FNAME(set_pte)(vcpu, walker->pte, shadow_ent, walker->inherited_ar, user_fault, write_fault, ptwrite, walker, walker->gfn); } @@ -442,7 +472,6 @@ static int FNAME(page_fault)(struct kvm_ if (!r) { pgprintk("%s: guest page fault\n", __FUNCTION__); inject_page_fault(vcpu, addr, walker.error_code); - FNAME(release_walker)(&walker); vcpu->last_pt_write_count = 0; /* reset fork detector */ return 0; } @@ -452,8 +481,6 @@ static int FNAME(page_fault)(struct kvm_ pgprintk("%s: shadow pte %p %llx ptwrite %d\n", __FUNCTION__, shadow_pte, *shadow_pte, write_pt); - FNAME(release_walker)(&walker); - if (!write_pt) vcpu->last_pt_write_count = 0; /* reset fork detector */ @@ -482,10 +509,29 @@ static gpa_t FNAME(gva_to_gpa)(struct kv gpa |= vaddr & ~PAGE_MASK; } - FNAME(release_walker)(&walker); return gpa; } +static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp) +{ + int i; + pt_element_t *gpt; + + if (sp->role.metaphysical || PTTYPE == 32) { + nonpaging_prefetch_page(vcpu, sp); + return; + } + + gpt = kmap_atomic(gfn_to_page(vcpu->kvm, sp->gfn), KM_USER0); + for (i = 0; i < PT64_ENT_PER_PAGE; ++i) + if (is_present_pte(gpt[i])) + sp->spt[i] = shadow_trap_nonpresent_pte; + else + sp->spt[i] = shadow_notrap_nonpresent_pte; + kunmap_atomic(gpt, KM_USER0); +} + #undef pt_element_t #undef guest_walker #undef FNAME @@ -494,4 +540,5 @@ #undef PT_INDEX #undef SHADOW_PT_INDEX #undef PT_LEVEL_MASK #undef PT_DIR_BASE_ADDR_MASK +#undef PT_LEVEL_BITS #undef PT_MAX_FULL_LEVELS diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index bc818cc..a533088 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -16,12 +16,12 @@ #include "kvm_svm.h" #include "x86_emulate.h" +#include "irq.h" #include #include #include #include -#include #include #include @@ -38,7 +38,6 @@ #define GP_VECTOR 13 #define DR7_GD_MASK (1 << 13) #define DR6_BD_MASK (1 << 13) -#define CR4_DE_MASK (1UL << 3) #define SEG_TYPE_LDT 2 #define SEG_TYPE_BUSY_TSS16 3 @@ -50,6 +49,13 @@ #define SVM_FEATURE_NPT (1 << 0) #define SVM_FEATURE_LBRV (1 << 1) #define SVM_DEATURE_SVML (1 << 2) +static void kvm_reput_irq(struct vcpu_svm *svm); + +static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) +{ + return container_of(vcpu, struct vcpu_svm, vcpu); +} + unsigned long iopm_base; unsigned long msrpm_base; @@ -94,20 +100,6 @@ static inline u32 svm_has(u32 feat) return svm_features & feat; } -static unsigned get_addr_size(struct kvm_vcpu *vcpu) -{ - struct vmcb_save_area *sa = &vcpu->svm->vmcb->save; - u16 cs_attrib; - - if (!(sa->cr0 & CR0_PE_MASK) || (sa->rflags & X86_EFLAGS_VM)) - return 2; - - cs_attrib = sa->cs.attrib; - - return (cs_attrib & SVM_SELECTOR_L_MASK) ? 8 : - (cs_attrib & SVM_SELECTOR_DB_MASK) ? 4 : 2; -} - static inline u8 pop_irq(struct kvm_vcpu *vcpu) { int word_index = __ffs(vcpu->irq_summary); @@ -182,7 +174,7 @@ static inline void write_dr7(unsigned lo static inline void force_new_asid(struct kvm_vcpu *vcpu) { - vcpu->svm->asid_generation--; + to_svm(vcpu)->asid_generation--; } static inline void flush_guest_tlb(struct kvm_vcpu *vcpu) @@ -195,22 +187,24 @@ static void svm_set_efer(struct kvm_vcpu if (!(efer & KVM_EFER_LMA)) efer &= ~KVM_EFER_LME; - vcpu->svm->vmcb->save.efer = efer | MSR_EFER_SVME_MASK; + to_svm(vcpu)->vmcb->save.efer = efer | MSR_EFER_SVME_MASK; vcpu->shadow_efer = efer; } static void svm_inject_gp(struct kvm_vcpu *vcpu, unsigned error_code) { - vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_TYPE_EXEPT | GP_VECTOR; - vcpu->svm->vmcb->control.event_inj_err = error_code; + svm->vmcb->control.event_inj_err = error_code; } static void inject_ud(struct kvm_vcpu *vcpu) { - vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + to_svm(vcpu)->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT | UD_VECTOR; } @@ -229,19 +223,21 @@ static int is_external_interrupt(u32 inf static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { - if (!vcpu->svm->next_rip) { + struct vcpu_svm *svm = to_svm(vcpu); + + if (!svm->next_rip) { printk(KERN_DEBUG "%s: NOP\n", __FUNCTION__); return; } - if (vcpu->svm->next_rip - vcpu->svm->vmcb->save.rip > 15) { + if (svm->next_rip - svm->vmcb->save.rip > MAX_INST_SIZE) { printk(KERN_ERR "%s: ip 0x%llx next 0x%llx\n", __FUNCTION__, - vcpu->svm->vmcb->save.rip, - vcpu->svm->next_rip); + svm->vmcb->save.rip, + svm->next_rip); } - vcpu->rip = vcpu->svm->vmcb->save.rip = vcpu->svm->next_rip; - vcpu->svm->vmcb->control.int_state &= ~SVM_INTERRUPT_SHADOW_MASK; + vcpu->rip = svm->vmcb->save.rip = svm->next_rip; + svm->vmcb->control.int_state &= ~SVM_INTERRUPT_SHADOW_MASK; vcpu->interrupt_window_open = 1; } @@ -351,8 +347,8 @@ err_1: } -static int set_msr_interception(u32 *msrpm, unsigned msr, - int read, int write) +static void set_msr_interception(u32 *msrpm, unsigned msr, + int read, int write) { int i; @@ -367,11 +363,10 @@ static int set_msr_interception(u32 *msr u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1); *base = (*base & ~(0x3 << msr_shift)) | (mask << msr_shift); - return 1; + return; } } - printk(KERN_DEBUG "%s: not found 0x%x\n", __FUNCTION__, msr); - return 0; + BUG(); } static __init int svm_hardware_setup(void) @@ -382,8 +377,6 @@ static __init int svm_hardware_setup(voi void *iopm_va, *msrpm_va; int r; - kvm_emulator_want_group7_invlpg(); - iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER); if (!iopm_pages) @@ -458,11 +451,6 @@ static void init_sys_seg(struct vmcb_seg seg->base = 0; } -static int svm_vcpu_setup(struct kvm_vcpu *vcpu) -{ - return 0; -} - static void init_vmcb(struct vmcb *vmcb) { struct vmcb_control_area *control = &vmcb->control; @@ -488,7 +476,8 @@ static void init_vmcb(struct vmcb *vmcb) INTERCEPT_DR5_MASK | INTERCEPT_DR7_MASK; - control->intercept_exceptions = 1 << PF_VECTOR; + control->intercept_exceptions = (1 << PF_VECTOR) | + (1 << UD_VECTOR); control->intercept = (1ULL << INTERCEPT_INTR) | @@ -563,59 +552,83 @@ static void init_vmcb(struct vmcb *vmcb) * cr0 val on cpu init should be 0x60000010, we enable cpu * cache by default. the orderly way is to enable cache in bios. */ - save->cr0 = 0x00000010 | CR0_PG_MASK | CR0_WP_MASK; - save->cr4 = CR4_PAE_MASK; + save->cr0 = 0x00000010 | X86_CR0_PG | X86_CR0_WP; + save->cr4 = X86_CR4_PAE; /* rdx = ?? */ } -static int svm_create_vcpu(struct kvm_vcpu *vcpu) +static void svm_vcpu_reset(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + init_vmcb(svm->vmcb); +} + +static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) { + struct vcpu_svm *svm; struct page *page; - int r; + int err; + + svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); + if (!svm) { + err = -ENOMEM; + goto out; + } + + err = kvm_vcpu_init(&svm->vcpu, kvm, id); + if (err) + goto free_svm; + + if (irqchip_in_kernel(kvm)) { + err = kvm_create_lapic(&svm->vcpu); + if (err < 0) + goto free_svm; + } - r = -ENOMEM; - vcpu->svm = kzalloc(sizeof *vcpu->svm, GFP_KERNEL); - if (!vcpu->svm) - goto out1; page = alloc_page(GFP_KERNEL); - if (!page) - goto out2; - - vcpu->svm->vmcb = page_address(page); - clear_page(vcpu->svm->vmcb); - vcpu->svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT; - vcpu->svm->asid_generation = 0; - memset(vcpu->svm->db_regs, 0, sizeof(vcpu->svm->db_regs)); - init_vmcb(vcpu->svm->vmcb); - - fx_init(vcpu); - vcpu->fpu_active = 1; - vcpu->apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; - if (vcpu == &vcpu->kvm->vcpus[0]) - vcpu->apic_base |= MSR_IA32_APICBASE_BSP; + if (!page) { + err = -ENOMEM; + goto uninit; + } - return 0; + svm->vmcb = page_address(page); + clear_page(svm->vmcb); + svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT; + svm->asid_generation = 0; + memset(svm->db_regs, 0, sizeof(svm->db_regs)); + init_vmcb(svm->vmcb); -out2: - kfree(vcpu->svm); -out1: - return r; + fx_init(&svm->vcpu); + svm->vcpu.fpu_active = 1; + svm->vcpu.apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; + if (svm->vcpu.vcpu_id == 0) + svm->vcpu.apic_base |= MSR_IA32_APICBASE_BSP; + + return &svm->vcpu; + +uninit: + kvm_vcpu_uninit(&svm->vcpu); +free_svm: + kmem_cache_free(kvm_vcpu_cache, svm); +out: + return ERR_PTR(err); } static void svm_free_vcpu(struct kvm_vcpu *vcpu) { - if (!vcpu->svm) - return; - if (vcpu->svm->vmcb) - __free_page(pfn_to_page(vcpu->svm->vmcb_pa >> PAGE_SHIFT)); - kfree(vcpu->svm); + struct vcpu_svm *svm = to_svm(vcpu); + + __free_page(pfn_to_page(svm->vmcb_pa >> PAGE_SHIFT)); + kvm_vcpu_uninit(vcpu); + kmem_cache_free(kvm_vcpu_cache, svm); } -static void svm_vcpu_load(struct kvm_vcpu *vcpu) +static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { - int cpu, i; + struct vcpu_svm *svm = to_svm(vcpu); + int i; - cpu = get_cpu(); if (unlikely(cpu != vcpu->cpu)) { u64 tsc_this, delta; @@ -625,23 +638,24 @@ static void svm_vcpu_load(struct kvm_vcp */ rdtscll(tsc_this); delta = vcpu->host_tsc - tsc_this; - vcpu->svm->vmcb->control.tsc_offset += delta; + svm->vmcb->control.tsc_offset += delta; vcpu->cpu = cpu; + kvm_migrate_apic_timer(vcpu); } for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) - rdmsrl(host_save_user_msrs[i], vcpu->svm->host_user_msrs[i]); + rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); } static void svm_vcpu_put(struct kvm_vcpu *vcpu) { + struct vcpu_svm *svm = to_svm(vcpu); int i; for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) - wrmsrl(host_save_user_msrs[i], vcpu->svm->host_user_msrs[i]); + wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); rdtscll(vcpu->host_tsc); - put_cpu(); } static void svm_vcpu_decache(struct kvm_vcpu *vcpu) @@ -650,31 +664,34 @@ static void svm_vcpu_decache(struct kvm_ static void svm_cache_regs(struct kvm_vcpu *vcpu) { - vcpu->regs[VCPU_REGS_RAX] = vcpu->svm->vmcb->save.rax; - vcpu->regs[VCPU_REGS_RSP] = vcpu->svm->vmcb->save.rsp; - vcpu->rip = vcpu->svm->vmcb->save.rip; + struct vcpu_svm *svm = to_svm(vcpu); + + vcpu->regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; + vcpu->regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp; + vcpu->rip = svm->vmcb->save.rip; } static void svm_decache_regs(struct kvm_vcpu *vcpu) { - vcpu->svm->vmcb->save.rax = vcpu->regs[VCPU_REGS_RAX]; - vcpu->svm->vmcb->save.rsp = vcpu->regs[VCPU_REGS_RSP]; - vcpu->svm->vmcb->save.rip = vcpu->rip; + struct vcpu_svm *svm = to_svm(vcpu); + svm->vmcb->save.rax = vcpu->regs[VCPU_REGS_RAX]; + svm->vmcb->save.rsp = vcpu->regs[VCPU_REGS_RSP]; + svm->vmcb->save.rip = vcpu->rip; } static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) { - return vcpu->svm->vmcb->save.rflags; + return to_svm(vcpu)->vmcb->save.rflags; } static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { - vcpu->svm->vmcb->save.rflags = rflags; + to_svm(vcpu)->vmcb->save.rflags = rflags; } static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg) { - struct vmcb_save_area *save = &vcpu->svm->vmcb->save; + struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save; switch (seg) { case VCPU_SREG_CS: return &save->cs; @@ -716,36 +733,36 @@ static void svm_get_segment(struct kvm_v var->unusable = !var->present; } -static void svm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) -{ - struct vmcb_seg *s = svm_seg(vcpu, VCPU_SREG_CS); - - *db = (s->attrib >> SVM_SELECTOR_DB_SHIFT) & 1; - *l = (s->attrib >> SVM_SELECTOR_L_SHIFT) & 1; -} - static void svm_get_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) { - dt->limit = vcpu->svm->vmcb->save.idtr.limit; - dt->base = vcpu->svm->vmcb->save.idtr.base; + struct vcpu_svm *svm = to_svm(vcpu); + + dt->limit = svm->vmcb->save.idtr.limit; + dt->base = svm->vmcb->save.idtr.base; } static void svm_set_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) { - vcpu->svm->vmcb->save.idtr.limit = dt->limit; - vcpu->svm->vmcb->save.idtr.base = dt->base ; + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.idtr.limit = dt->limit; + svm->vmcb->save.idtr.base = dt->base ; } static void svm_get_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) { - dt->limit = vcpu->svm->vmcb->save.gdtr.limit; - dt->base = vcpu->svm->vmcb->save.gdtr.base; + struct vcpu_svm *svm = to_svm(vcpu); + + dt->limit = svm->vmcb->save.gdtr.limit; + dt->base = svm->vmcb->save.gdtr.base; } static void svm_set_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) { - vcpu->svm->vmcb->save.gdtr.limit = dt->limit; - vcpu->svm->vmcb->save.gdtr.base = dt->base ; + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.gdtr.limit = dt->limit; + svm->vmcb->save.gdtr.base = dt->base ; } static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu) @@ -754,39 +771,42 @@ static void svm_decache_cr4_guest_bits(s static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) { + struct vcpu_svm *svm = to_svm(vcpu); + #ifdef CONFIG_X86_64 if (vcpu->shadow_efer & KVM_EFER_LME) { - if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) { + if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { vcpu->shadow_efer |= KVM_EFER_LMA; - vcpu->svm->vmcb->save.efer |= KVM_EFER_LMA | KVM_EFER_LME; + svm->vmcb->save.efer |= KVM_EFER_LMA | KVM_EFER_LME; } - if (is_paging(vcpu) && !(cr0 & CR0_PG_MASK) ) { + if (is_paging(vcpu) && !(cr0 & X86_CR0_PG) ) { vcpu->shadow_efer &= ~KVM_EFER_LMA; - vcpu->svm->vmcb->save.efer &= ~(KVM_EFER_LMA | KVM_EFER_LME); + svm->vmcb->save.efer &= ~(KVM_EFER_LMA | KVM_EFER_LME); } } #endif - if ((vcpu->cr0 & CR0_TS_MASK) && !(cr0 & CR0_TS_MASK)) { - vcpu->svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); + if ((vcpu->cr0 & X86_CR0_TS) && !(cr0 & X86_CR0_TS)) { + svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); vcpu->fpu_active = 1; } vcpu->cr0 = cr0; - cr0 |= CR0_PG_MASK | CR0_WP_MASK; - cr0 &= ~(CR0_CD_MASK | CR0_NW_MASK); - vcpu->svm->vmcb->save.cr0 = cr0; + cr0 |= X86_CR0_PG | X86_CR0_WP; + cr0 &= ~(X86_CR0_CD | X86_CR0_NW); + svm->vmcb->save.cr0 = cr0; } static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { vcpu->cr4 = cr4; - vcpu->svm->vmcb->save.cr4 = cr4 | CR4_PAE_MASK; + to_svm(vcpu)->vmcb->save.cr4 = cr4 | X86_CR4_PAE; } static void svm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) { + struct vcpu_svm *svm = to_svm(vcpu); struct vmcb_seg *s = svm_seg(vcpu, seg); s->base = var->base; @@ -805,16 +825,16 @@ static void svm_set_segment(struct kvm_v s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT; } if (seg == VCPU_SREG_CS) - vcpu->svm->vmcb->save.cpl - = (vcpu->svm->vmcb->save.cs.attrib + svm->vmcb->save.cpl + = (svm->vmcb->save.cs.attrib >> SVM_SELECTOR_DPL_SHIFT) & 3; } /* FIXME: - vcpu->svm->vmcb->control.int_ctl &= ~V_TPR_MASK; - vcpu->svm->vmcb->control.int_ctl |= (sregs->cr8 & V_TPR_MASK); + svm(vcpu)->vmcb->control.int_ctl &= ~V_TPR_MASK; + svm(vcpu)->vmcb->control.int_ctl |= (sregs->cr8 & V_TPR_MASK); */ @@ -823,61 +843,68 @@ static int svm_guest_debug(struct kvm_vc return -EOPNOTSUPP; } +static int svm_get_irq(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u32 exit_int_info = svm->vmcb->control.exit_int_info; + + if (is_external_interrupt(exit_int_info)) + return exit_int_info & SVM_EVTINJ_VEC_MASK; + return -1; +} + static void load_host_msrs(struct kvm_vcpu *vcpu) { #ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, vcpu->svm->host_gs_base); + wrmsrl(MSR_GS_BASE, to_svm(vcpu)->host_gs_base); #endif } static void save_host_msrs(struct kvm_vcpu *vcpu) { #ifdef CONFIG_X86_64 - rdmsrl(MSR_GS_BASE, vcpu->svm->host_gs_base); + rdmsrl(MSR_GS_BASE, to_svm(vcpu)->host_gs_base); #endif } -static void new_asid(struct kvm_vcpu *vcpu, struct svm_cpu_data *svm_data) +static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *svm_data) { if (svm_data->next_asid > svm_data->max_asid) { ++svm_data->asid_generation; svm_data->next_asid = 1; - vcpu->svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID; + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID; } - vcpu->cpu = svm_data->cpu; - vcpu->svm->asid_generation = svm_data->asid_generation; - vcpu->svm->vmcb->control.asid = svm_data->next_asid++; -} - -static void svm_invlpg(struct kvm_vcpu *vcpu, gva_t address) -{ - invlpga(address, vcpu->svm->vmcb->control.asid); // is needed? + svm->vcpu.cpu = svm_data->cpu; + svm->asid_generation = svm_data->asid_generation; + svm->vmcb->control.asid = svm_data->next_asid++; } static unsigned long svm_get_dr(struct kvm_vcpu *vcpu, int dr) { - return vcpu->svm->db_regs[dr]; + return to_svm(vcpu)->db_regs[dr]; } static void svm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long value, int *exception) { + struct vcpu_svm *svm = to_svm(vcpu); + *exception = 0; - if (vcpu->svm->vmcb->save.dr7 & DR7_GD_MASK) { - vcpu->svm->vmcb->save.dr7 &= ~DR7_GD_MASK; - vcpu->svm->vmcb->save.dr6 |= DR6_BD_MASK; + if (svm->vmcb->save.dr7 & DR7_GD_MASK) { + svm->vmcb->save.dr7 &= ~DR7_GD_MASK; + svm->vmcb->save.dr6 |= DR6_BD_MASK; *exception = DB_VECTOR; return; } switch (dr) { case 0 ... 3: - vcpu->svm->db_regs[dr] = value; + svm->db_regs[dr] = value; return; case 4 ... 5: - if (vcpu->cr4 & CR4_DE_MASK) { + if (vcpu->cr4 & X86_CR4_DE) { *exception = UD_VECTOR; return; } @@ -886,7 +913,7 @@ static void svm_set_dr(struct kvm_vcpu * *exception = GP_VECTOR; return; } - vcpu->svm->vmcb->save.dr7 = value; + svm->vmcb->save.dr7 = value; return; } default: @@ -897,42 +924,44 @@ static void svm_set_dr(struct kvm_vcpu * } } -static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int pf_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - u32 exit_int_info = vcpu->svm->vmcb->control.exit_int_info; + u32 exit_int_info = svm->vmcb->control.exit_int_info; + struct kvm *kvm = svm->vcpu.kvm; u64 fault_address; u32 error_code; enum emulation_result er; int r; - if (is_external_interrupt(exit_int_info)) - push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK); + if (!irqchip_in_kernel(kvm) && + is_external_interrupt(exit_int_info)) + push_irq(&svm->vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK); - spin_lock(&vcpu->kvm->lock); + mutex_lock(&kvm->lock); - fault_address = vcpu->svm->vmcb->control.exit_info_2; - error_code = vcpu->svm->vmcb->control.exit_info_1; - r = kvm_mmu_page_fault(vcpu, fault_address, error_code); + fault_address = svm->vmcb->control.exit_info_2; + error_code = svm->vmcb->control.exit_info_1; + r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code); if (r < 0) { - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&kvm->lock); return r; } if (!r) { - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&kvm->lock); return 1; } - er = emulate_instruction(vcpu, kvm_run, fault_address, error_code); - spin_unlock(&vcpu->kvm->lock); + er = emulate_instruction(&svm->vcpu, kvm_run, fault_address, + error_code, 0); + mutex_unlock(&kvm->lock); switch (er) { case EMULATE_DONE: return 1; case EMULATE_DO_MMIO: - ++vcpu->stat.mmio_exits; - kvm_run->exit_reason = KVM_EXIT_MMIO; + ++svm->vcpu.stat.mmio_exits; return 0; case EMULATE_FAIL: - vcpu_printf(vcpu, "%s: emulate fail\n", __FUNCTION__); + kvm_report_emulation_failure(&svm->vcpu, "pagetable"); break; default: BUG(); @@ -942,252 +971,155 @@ static int pf_interception(struct kvm_vc return 0; } -static int nm_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + int er; + + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0, 0); + if (er != EMULATE_DONE) + inject_ud(&svm->vcpu); + + return 1; +} + +static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - vcpu->svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); - if (!(vcpu->cr0 & CR0_TS_MASK)) - vcpu->svm->vmcb->save.cr0 &= ~CR0_TS_MASK; - vcpu->fpu_active = 1; + svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); + if (!(svm->vcpu.cr0 & X86_CR0_TS)) + svm->vmcb->save.cr0 &= ~X86_CR0_TS; + svm->vcpu.fpu_active = 1; - return 1; + return 1; } -static int shutdown_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int shutdown_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { /* * VMCB is undefined after a SHUTDOWN intercept * so reinitialize it. */ - clear_page(vcpu->svm->vmcb); - init_vmcb(vcpu->svm->vmcb); + clear_page(svm->vmcb); + init_vmcb(svm->vmcb); kvm_run->exit_reason = KVM_EXIT_SHUTDOWN; return 0; } -static int io_get_override(struct kvm_vcpu *vcpu, - struct vmcb_seg **seg, - int *addr_override) -{ - u8 inst[MAX_INST_SIZE]; - unsigned ins_length; - gva_t rip; - int i; - - rip = vcpu->svm->vmcb->save.rip; - ins_length = vcpu->svm->next_rip - rip; - rip += vcpu->svm->vmcb->save.cs.base; - - if (ins_length > MAX_INST_SIZE) - printk(KERN_DEBUG - "%s: inst length err, cs base 0x%llx rip 0x%llx " - "next rip 0x%llx ins_length %u\n", - __FUNCTION__, - vcpu->svm->vmcb->save.cs.base, - vcpu->svm->vmcb->save.rip, - vcpu->svm->vmcb->control.exit_info_2, - ins_length); - - if (kvm_read_guest(vcpu, rip, ins_length, inst) != ins_length) - /* #PF */ - return 0; - - *addr_override = 0; - *seg = NULL; - for (i = 0; i < ins_length; i++) - switch (inst[i]) { - case 0xf0: - case 0xf2: - case 0xf3: - case 0x66: - continue; - case 0x67: - *addr_override = 1; - continue; - case 0x2e: - *seg = &vcpu->svm->vmcb->save.cs; - continue; - case 0x36: - *seg = &vcpu->svm->vmcb->save.ss; - continue; - case 0x3e: - *seg = &vcpu->svm->vmcb->save.ds; - continue; - case 0x26: - *seg = &vcpu->svm->vmcb->save.es; - continue; - case 0x64: - *seg = &vcpu->svm->vmcb->save.fs; - continue; - case 0x65: - *seg = &vcpu->svm->vmcb->save.gs; - continue; - default: - return 1; - } - printk(KERN_DEBUG "%s: unexpected\n", __FUNCTION__); - return 0; -} - -static unsigned long io_adress(struct kvm_vcpu *vcpu, int ins, gva_t *address) +static int io_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - unsigned long addr_mask; - unsigned long *reg; - struct vmcb_seg *seg; - int addr_override; - struct vmcb_save_area *save_area = &vcpu->svm->vmcb->save; - u16 cs_attrib = save_area->cs.attrib; - unsigned addr_size = get_addr_size(vcpu); - - if (!io_get_override(vcpu, &seg, &addr_override)) - return 0; - - if (addr_override) - addr_size = (addr_size == 2) ? 4: (addr_size >> 1); + u32 io_info = svm->vmcb->control.exit_info_1; //address size bug? + int size, down, in, string, rep; + unsigned port; - if (ins) { - reg = &vcpu->regs[VCPU_REGS_RDI]; - seg = &vcpu->svm->vmcb->save.es; - } else { - reg = &vcpu->regs[VCPU_REGS_RSI]; - seg = (seg) ? seg : &vcpu->svm->vmcb->save.ds; - } + ++svm->vcpu.stat.io_exits; - addr_mask = ~0ULL >> (64 - (addr_size * 8)); + svm->next_rip = svm->vmcb->control.exit_info_2; - if ((cs_attrib & SVM_SELECTOR_L_MASK) && - !(vcpu->svm->vmcb->save.rflags & X86_EFLAGS_VM)) { - *address = (*reg & addr_mask); - return addr_mask; - } + string = (io_info & SVM_IOIO_STR_MASK) != 0; - if (!(seg->attrib & SVM_SELECTOR_P_SHIFT)) { - svm_inject_gp(vcpu, 0); - return 0; + if (string) { + if (emulate_instruction(&svm->vcpu, + kvm_run, 0, 0, 0) == EMULATE_DO_MMIO) + return 0; + return 1; } - *address = (*reg & addr_mask) + seg->base; - return addr_mask; -} - -static int io_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) -{ - u32 io_info = vcpu->svm->vmcb->control.exit_info_1; //address size bug? - int size, down, in, string, rep; - unsigned port; - unsigned long count; - gva_t address = 0; - - ++vcpu->stat.io_exits; - - vcpu->svm->next_rip = vcpu->svm->vmcb->control.exit_info_2; - in = (io_info & SVM_IOIO_TYPE_MASK) != 0; port = io_info >> 16; size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT; - string = (io_info & SVM_IOIO_STR_MASK) != 0; rep = (io_info & SVM_IOIO_REP_MASK) != 0; - count = 1; - down = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_DF) != 0; - - if (string) { - unsigned addr_mask; - - addr_mask = io_adress(vcpu, in, &address); - if (!addr_mask) { - printk(KERN_DEBUG "%s: get io address failed\n", - __FUNCTION__); - return 1; - } + down = (svm->vmcb->save.rflags & X86_EFLAGS_DF) != 0; - if (rep) - count = vcpu->regs[VCPU_REGS_RCX] & addr_mask; - } - return kvm_setup_pio(vcpu, kvm_run, in, size, count, string, down, - address, rep, port); + return kvm_emulate_pio(&svm->vcpu, kvm_run, in, size, port); } -static int nop_on_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int nop_on_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { return 1; } -static int halt_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int halt_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 1; - skip_emulated_instruction(vcpu); - return kvm_emulate_halt(vcpu); + svm->next_rip = svm->vmcb->save.rip + 1; + skip_emulated_instruction(&svm->vcpu); + return kvm_emulate_halt(&svm->vcpu); } -static int vmmcall_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int vmmcall_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 3; - skip_emulated_instruction(vcpu); - return kvm_hypercall(vcpu, kvm_run); + svm->next_rip = svm->vmcb->save.rip + 3; + skip_emulated_instruction(&svm->vcpu); + kvm_emulate_hypercall(&svm->vcpu); + return 1; } -static int invalid_op_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int invalid_op_interception(struct vcpu_svm *svm, + struct kvm_run *kvm_run) { - inject_ud(vcpu); + inject_ud(&svm->vcpu); return 1; } -static int task_switch_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int task_switch_interception(struct vcpu_svm *svm, + struct kvm_run *kvm_run) { - printk(KERN_DEBUG "%s: task swiche is unsupported\n", __FUNCTION__); + pr_unimpl(&svm->vcpu, "%s: task switch is unsupported\n", __FUNCTION__); kvm_run->exit_reason = KVM_EXIT_UNKNOWN; return 0; } -static int cpuid_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; - kvm_emulate_cpuid(vcpu); + svm->next_rip = svm->vmcb->save.rip + 2; + kvm_emulate_cpuid(&svm->vcpu); return 1; } -static int emulate_on_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int emulate_on_interception(struct vcpu_svm *svm, + struct kvm_run *kvm_run) { - if (emulate_instruction(vcpu, NULL, 0, 0) != EMULATE_DONE) - printk(KERN_ERR "%s: failed\n", __FUNCTION__); + if (emulate_instruction(&svm->vcpu, NULL, 0, 0, 0) != EMULATE_DONE) + pr_unimpl(&svm->vcpu, "%s: failed\n", __FUNCTION__); return 1; } static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) { + struct vcpu_svm *svm = to_svm(vcpu); + switch (ecx) { case MSR_IA32_TIME_STAMP_COUNTER: { u64 tsc; rdtscll(tsc); - *data = vcpu->svm->vmcb->control.tsc_offset + tsc; + *data = svm->vmcb->control.tsc_offset + tsc; break; } case MSR_K6_STAR: - *data = vcpu->svm->vmcb->save.star; + *data = svm->vmcb->save.star; break; #ifdef CONFIG_X86_64 case MSR_LSTAR: - *data = vcpu->svm->vmcb->save.lstar; + *data = svm->vmcb->save.lstar; break; case MSR_CSTAR: - *data = vcpu->svm->vmcb->save.cstar; + *data = svm->vmcb->save.cstar; break; case MSR_KERNEL_GS_BASE: - *data = vcpu->svm->vmcb->save.kernel_gs_base; + *data = svm->vmcb->save.kernel_gs_base; break; case MSR_SYSCALL_MASK: - *data = vcpu->svm->vmcb->save.sfmask; + *data = svm->vmcb->save.sfmask; break; #endif case MSR_IA32_SYSENTER_CS: - *data = vcpu->svm->vmcb->save.sysenter_cs; + *data = svm->vmcb->save.sysenter_cs; break; case MSR_IA32_SYSENTER_EIP: - *data = vcpu->svm->vmcb->save.sysenter_eip; + *data = svm->vmcb->save.sysenter_eip; break; case MSR_IA32_SYSENTER_ESP: - *data = vcpu->svm->vmcb->save.sysenter_esp; + *data = svm->vmcb->save.sysenter_esp; break; default: return kvm_get_msr_common(vcpu, ecx, data); @@ -1195,57 +1127,59 @@ #endif return 0; } -static int rdmsr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int rdmsr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - u32 ecx = vcpu->regs[VCPU_REGS_RCX]; + u32 ecx = svm->vcpu.regs[VCPU_REGS_RCX]; u64 data; - if (svm_get_msr(vcpu, ecx, &data)) - svm_inject_gp(vcpu, 0); + if (svm_get_msr(&svm->vcpu, ecx, &data)) + svm_inject_gp(&svm->vcpu, 0); else { - vcpu->svm->vmcb->save.rax = data & 0xffffffff; - vcpu->regs[VCPU_REGS_RDX] = data >> 32; - vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; - skip_emulated_instruction(vcpu); + svm->vmcb->save.rax = data & 0xffffffff; + svm->vcpu.regs[VCPU_REGS_RDX] = data >> 32; + svm->next_rip = svm->vmcb->save.rip + 2; + skip_emulated_instruction(&svm->vcpu); } return 1; } static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) { + struct vcpu_svm *svm = to_svm(vcpu); + switch (ecx) { case MSR_IA32_TIME_STAMP_COUNTER: { u64 tsc; rdtscll(tsc); - vcpu->svm->vmcb->control.tsc_offset = data - tsc; + svm->vmcb->control.tsc_offset = data - tsc; break; } case MSR_K6_STAR: - vcpu->svm->vmcb->save.star = data; + svm->vmcb->save.star = data; break; #ifdef CONFIG_X86_64 case MSR_LSTAR: - vcpu->svm->vmcb->save.lstar = data; + svm->vmcb->save.lstar = data; break; case MSR_CSTAR: - vcpu->svm->vmcb->save.cstar = data; + svm->vmcb->save.cstar = data; break; case MSR_KERNEL_GS_BASE: - vcpu->svm->vmcb->save.kernel_gs_base = data; + svm->vmcb->save.kernel_gs_base = data; break; case MSR_SYSCALL_MASK: - vcpu->svm->vmcb->save.sfmask = data; + svm->vmcb->save.sfmask = data; break; #endif case MSR_IA32_SYSENTER_CS: - vcpu->svm->vmcb->save.sysenter_cs = data; + svm->vmcb->save.sysenter_cs = data; break; case MSR_IA32_SYSENTER_EIP: - vcpu->svm->vmcb->save.sysenter_eip = data; + svm->vmcb->save.sysenter_eip = data; break; case MSR_IA32_SYSENTER_ESP: - vcpu->svm->vmcb->save.sysenter_esp = data; + svm->vmcb->save.sysenter_esp = data; break; default: return kvm_set_msr_common(vcpu, ecx, data); @@ -1253,37 +1187,39 @@ #endif return 0; } -static int wrmsr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int wrmsr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - u32 ecx = vcpu->regs[VCPU_REGS_RCX]; - u64 data = (vcpu->svm->vmcb->save.rax & -1u) - | ((u64)(vcpu->regs[VCPU_REGS_RDX] & -1u) << 32); - vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; - if (svm_set_msr(vcpu, ecx, data)) - svm_inject_gp(vcpu, 0); + u32 ecx = svm->vcpu.regs[VCPU_REGS_RCX]; + u64 data = (svm->vmcb->save.rax & -1u) + | ((u64)(svm->vcpu.regs[VCPU_REGS_RDX] & -1u) << 32); + svm->next_rip = svm->vmcb->save.rip + 2; + if (svm_set_msr(&svm->vcpu, ecx, data)) + svm_inject_gp(&svm->vcpu, 0); else - skip_emulated_instruction(vcpu); + skip_emulated_instruction(&svm->vcpu); return 1; } -static int msr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int msr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { - if (vcpu->svm->vmcb->control.exit_info_1) - return wrmsr_interception(vcpu, kvm_run); + if (svm->vmcb->control.exit_info_1) + return wrmsr_interception(svm, kvm_run); else - return rdmsr_interception(vcpu, kvm_run); + return rdmsr_interception(svm, kvm_run); } -static int interrupt_window_interception(struct kvm_vcpu *vcpu, +static int interrupt_window_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { + svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_VINTR); + svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; /* * If the user space waits to inject interrupts, exit as soon as * possible */ if (kvm_run->request_interrupt_window && - !vcpu->irq_summary) { - ++vcpu->stat.irq_window_exits; + !svm->vcpu.irq_summary) { + ++svm->vcpu.stat.irq_window_exits; kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN; return 0; } @@ -1291,7 +1227,7 @@ static int interrupt_window_interception return 1; } -static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu, +static int (*svm_exit_handlers[])(struct vcpu_svm *svm, struct kvm_run *kvm_run) = { [SVM_EXIT_READ_CR0] = emulate_on_interception, [SVM_EXIT_READ_CR3] = emulate_on_interception, @@ -1310,6 +1246,7 @@ static int (*svm_exit_handlers[])(struct [SVM_EXIT_WRITE_DR3] = emulate_on_interception, [SVM_EXIT_WRITE_DR5] = emulate_on_interception, [SVM_EXIT_WRITE_DR7] = emulate_on_interception, + [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception, [SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception, [SVM_EXIT_EXCP_BASE + NM_VECTOR] = nm_interception, [SVM_EXIT_INTR] = nop_on_interception, @@ -1338,15 +1275,25 @@ static int (*svm_exit_handlers[])(struct }; -static int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static int handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) { - u32 exit_code = vcpu->svm->vmcb->control.exit_code; + struct vcpu_svm *svm = to_svm(vcpu); + u32 exit_code = svm->vmcb->control.exit_code; - if (is_external_interrupt(vcpu->svm->vmcb->control.exit_int_info) && + kvm_reput_irq(svm); + + if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) { + kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; + kvm_run->fail_entry.hardware_entry_failure_reason + = svm->vmcb->control.exit_code; + return 0; + } + + if (is_external_interrupt(svm->vmcb->control.exit_int_info) && exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR) printk(KERN_ERR "%s: unexpected exit_ini_info 0x%x " "exit_code 0x%x\n", - __FUNCTION__, vcpu->svm->vmcb->control.exit_int_info, + __FUNCTION__, svm->vmcb->control.exit_int_info, exit_code); if (exit_code >= ARRAY_SIZE(svm_exit_handlers) @@ -1356,7 +1303,7 @@ static int handle_exit(struct kvm_vcpu * return 0; } - return svm_exit_handlers[exit_code](vcpu, kvm_run); + return svm_exit_handlers[exit_code](svm, kvm_run); } static void reload_tss(struct kvm_vcpu *vcpu) @@ -1368,93 +1315,126 @@ static void reload_tss(struct kvm_vcpu * load_TR_desc(); } -static void pre_svm_run(struct kvm_vcpu *vcpu) +static void pre_svm_run(struct vcpu_svm *svm) { int cpu = raw_smp_processor_id(); struct svm_cpu_data *svm_data = per_cpu(svm_data, cpu); - vcpu->svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; - if (vcpu->cpu != cpu || - vcpu->svm->asid_generation != svm_data->asid_generation) - new_asid(vcpu, svm_data); + svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + if (svm->vcpu.cpu != cpu || + svm->asid_generation != svm_data->asid_generation) + new_asid(svm, svm_data); } -static inline void kvm_do_inject_irq(struct kvm_vcpu *vcpu) +static inline void svm_inject_irq(struct vcpu_svm *svm, int irq) { struct vmcb_control_area *control; - control = &vcpu->svm->vmcb->control; - control->int_vector = pop_irq(vcpu); + control = &svm->vmcb->control; + control->int_vector = irq; control->int_ctl &= ~V_INTR_PRIO_MASK; control->int_ctl |= V_IRQ_MASK | ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); } -static void kvm_reput_irq(struct kvm_vcpu *vcpu) +static void svm_set_irq(struct kvm_vcpu *vcpu, int irq) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm_inject_irq(svm, irq); +} + +static void svm_intr_assist(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm->vmcb; + int intr_vector = -1; + + kvm_inject_pending_timer_irqs(vcpu); + if ((vmcb->control.exit_int_info & SVM_EVTINJ_VALID) && + ((vmcb->control.exit_int_info & SVM_EVTINJ_TYPE_MASK) == 0)) { + intr_vector = vmcb->control.exit_int_info & + SVM_EVTINJ_VEC_MASK; + vmcb->control.exit_int_info = 0; + svm_inject_irq(svm, intr_vector); + return; + } + + if (vmcb->control.int_ctl & V_IRQ_MASK) + return; + + if (!kvm_cpu_has_interrupt(vcpu)) + return; + + if (!(vmcb->save.rflags & X86_EFLAGS_IF) || + (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) || + (vmcb->control.event_inj & SVM_EVTINJ_VALID)) { + /* unable to deliver irq, set pending irq */ + vmcb->control.intercept |= (1ULL << INTERCEPT_VINTR); + svm_inject_irq(svm, 0x0); + return; + } + /* Okay, we can deliver the interrupt: grab it and update PIC state. */ + intr_vector = kvm_cpu_get_interrupt(vcpu); + svm_inject_irq(svm, intr_vector); + kvm_timer_intr_post(vcpu, intr_vector); +} + +static void kvm_reput_irq(struct vcpu_svm *svm) { - struct vmcb_control_area *control = &vcpu->svm->vmcb->control; + struct vmcb_control_area *control = &svm->vmcb->control; - if (control->int_ctl & V_IRQ_MASK) { + if ((control->int_ctl & V_IRQ_MASK) + && !irqchip_in_kernel(svm->vcpu.kvm)) { control->int_ctl &= ~V_IRQ_MASK; - push_irq(vcpu, control->int_vector); + push_irq(&svm->vcpu, control->int_vector); } - vcpu->interrupt_window_open = + svm->vcpu.interrupt_window_open = !(control->int_state & SVM_INTERRUPT_SHADOW_MASK); } +static void svm_do_inject_vector(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + int word_index = __ffs(vcpu->irq_summary); + int bit_index = __ffs(vcpu->irq_pending[word_index]); + int irq = word_index * BITS_PER_LONG + bit_index; + + clear_bit(bit_index, &vcpu->irq_pending[word_index]); + if (!vcpu->irq_pending[word_index]) + clear_bit(word_index, &vcpu->irq_summary); + svm_inject_irq(svm, irq); +} + static void do_interrupt_requests(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { - struct vmcb_control_area *control = &vcpu->svm->vmcb->control; + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; - vcpu->interrupt_window_open = + svm->vcpu.interrupt_window_open = (!(control->int_state & SVM_INTERRUPT_SHADOW_MASK) && - (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF)); + (svm->vmcb->save.rflags & X86_EFLAGS_IF)); - if (vcpu->interrupt_window_open && vcpu->irq_summary) + if (svm->vcpu.interrupt_window_open && svm->vcpu.irq_summary) /* * If interrupts enabled, and not blocked by sti or mov ss. Good. */ - kvm_do_inject_irq(vcpu); + svm_do_inject_vector(svm); /* * Interrupts blocked. Wait for unblock. */ - if (!vcpu->interrupt_window_open && - (vcpu->irq_summary || kvm_run->request_interrupt_window)) { + if (!svm->vcpu.interrupt_window_open && + (svm->vcpu.irq_summary || kvm_run->request_interrupt_window)) { control->intercept |= 1ULL << INTERCEPT_VINTR; } else control->intercept &= ~(1ULL << INTERCEPT_VINTR); } -static void post_kvm_run_save(struct kvm_vcpu *vcpu, - struct kvm_run *kvm_run) -{ - kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open && - vcpu->irq_summary == 0); - kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0; - kvm_run->cr8 = vcpu->cr8; - kvm_run->apic_base = vcpu->apic_base; -} - -/* - * Check if userspace requested an interrupt window, and that the - * interrupt window is open. - * - * No need to exit to userspace if we already have an interrupt queued. - */ -static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu, - struct kvm_run *kvm_run) -{ - return (!vcpu->irq_summary && - kvm_run->request_interrupt_window && - vcpu->interrupt_window_open && - (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF)); -} - static void save_db_regs(unsigned long *db_regs) { asm volatile ("mov %%dr0, %0" : "=r"(db_regs[0])); @@ -1476,49 +1456,37 @@ static void svm_flush_tlb(struct kvm_vcp force_new_asid(vcpu); } -static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) { +} + +static void svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + struct vcpu_svm *svm = to_svm(vcpu); u16 fs_selector; u16 gs_selector; u16 ldt_selector; - int r; - -again: - r = kvm_mmu_reload(vcpu); - if (unlikely(r)) - return r; - if (!vcpu->mmio_read_completed) - do_interrupt_requests(vcpu, kvm_run); - - clgi(); - - vcpu->guest_mode = 1; - if (vcpu->requests) - if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) - svm_flush_tlb(vcpu); - - pre_svm_run(vcpu); + pre_svm_run(svm); save_host_msrs(vcpu); fs_selector = read_fs(); gs_selector = read_gs(); ldt_selector = read_ldt(); - vcpu->svm->host_cr2 = kvm_read_cr2(); - vcpu->svm->host_dr6 = read_dr6(); - vcpu->svm->host_dr7 = read_dr7(); - vcpu->svm->vmcb->save.cr2 = vcpu->cr2; + svm->host_cr2 = kvm_read_cr2(); + svm->host_dr6 = read_dr6(); + svm->host_dr7 = read_dr7(); + svm->vmcb->save.cr2 = vcpu->cr2; - if (vcpu->svm->vmcb->save.dr7 & 0xff) { + if (svm->vmcb->save.dr7 & 0xff) { write_dr7(0); - save_db_regs(vcpu->svm->host_db_regs); - load_db_regs(vcpu->svm->db_regs); + save_db_regs(svm->host_db_regs); + load_db_regs(svm->db_regs); } - if (vcpu->fpu_active) { - fx_save(vcpu->host_fx_image); - fx_restore(vcpu->guest_fx_image); - } + clgi(); + + local_irq_enable(); asm volatile ( #ifdef CONFIG_X86_64 @@ -1532,34 +1500,33 @@ #else #endif #ifdef CONFIG_X86_64 - "mov %c[rbx](%[vcpu]), %%rbx \n\t" - "mov %c[rcx](%[vcpu]), %%rcx \n\t" - "mov %c[rdx](%[vcpu]), %%rdx \n\t" - "mov %c[rsi](%[vcpu]), %%rsi \n\t" - "mov %c[rdi](%[vcpu]), %%rdi \n\t" - "mov %c[rbp](%[vcpu]), %%rbp \n\t" - "mov %c[r8](%[vcpu]), %%r8 \n\t" - "mov %c[r9](%[vcpu]), %%r9 \n\t" - "mov %c[r10](%[vcpu]), %%r10 \n\t" - "mov %c[r11](%[vcpu]), %%r11 \n\t" - "mov %c[r12](%[vcpu]), %%r12 \n\t" - "mov %c[r13](%[vcpu]), %%r13 \n\t" - "mov %c[r14](%[vcpu]), %%r14 \n\t" - "mov %c[r15](%[vcpu]), %%r15 \n\t" + "mov %c[rbx](%[svm]), %%rbx \n\t" + "mov %c[rcx](%[svm]), %%rcx \n\t" + "mov %c[rdx](%[svm]), %%rdx \n\t" + "mov %c[rsi](%[svm]), %%rsi \n\t" + "mov %c[rdi](%[svm]), %%rdi \n\t" + "mov %c[rbp](%[svm]), %%rbp \n\t" + "mov %c[r8](%[svm]), %%r8 \n\t" + "mov %c[r9](%[svm]), %%r9 \n\t" + "mov %c[r10](%[svm]), %%r10 \n\t" + "mov %c[r11](%[svm]), %%r11 \n\t" + "mov %c[r12](%[svm]), %%r12 \n\t" + "mov %c[r13](%[svm]), %%r13 \n\t" + "mov %c[r14](%[svm]), %%r14 \n\t" + "mov %c[r15](%[svm]), %%r15 \n\t" #else - "mov %c[rbx](%[vcpu]), %%ebx \n\t" - "mov %c[rcx](%[vcpu]), %%ecx \n\t" - "mov %c[rdx](%[vcpu]), %%edx \n\t" - "mov %c[rsi](%[vcpu]), %%esi \n\t" - "mov %c[rdi](%[vcpu]), %%edi \n\t" - "mov %c[rbp](%[vcpu]), %%ebp \n\t" + "mov %c[rbx](%[svm]), %%ebx \n\t" + "mov %c[rcx](%[svm]), %%ecx \n\t" + "mov %c[rdx](%[svm]), %%edx \n\t" + "mov %c[rsi](%[svm]), %%esi \n\t" + "mov %c[rdi](%[svm]), %%edi \n\t" + "mov %c[rbp](%[svm]), %%ebp \n\t" #endif #ifdef CONFIG_X86_64 /* Enter guest mode */ "push %%rax \n\t" - "mov %c[svm](%[vcpu]), %%rax \n\t" - "mov %c[vmcb](%%rax), %%rax \n\t" + "mov %c[vmcb](%[svm]), %%rax \n\t" SVM_VMLOAD "\n\t" SVM_VMRUN "\n\t" SVM_VMSAVE "\n\t" @@ -1567,8 +1534,7 @@ #ifdef CONFIG_X86_64 #else /* Enter guest mode */ "push %%eax \n\t" - "mov %c[svm](%[vcpu]), %%eax \n\t" - "mov %c[vmcb](%%eax), %%eax \n\t" + "mov %c[vmcb](%[svm]), %%eax \n\t" SVM_VMLOAD "\n\t" SVM_VMRUN "\n\t" SVM_VMSAVE "\n\t" @@ -1577,73 +1543,69 @@ #endif /* Save guest registers, load host registers */ #ifdef CONFIG_X86_64 - "mov %%rbx, %c[rbx](%[vcpu]) \n\t" - "mov %%rcx, %c[rcx](%[vcpu]) \n\t" - "mov %%rdx, %c[rdx](%[vcpu]) \n\t" - "mov %%rsi, %c[rsi](%[vcpu]) \n\t" - "mov %%rdi, %c[rdi](%[vcpu]) \n\t" - "mov %%rbp, %c[rbp](%[vcpu]) \n\t" - "mov %%r8, %c[r8](%[vcpu]) \n\t" - "mov %%r9, %c[r9](%[vcpu]) \n\t" - "mov %%r10, %c[r10](%[vcpu]) \n\t" - "mov %%r11, %c[r11](%[vcpu]) \n\t" - "mov %%r12, %c[r12](%[vcpu]) \n\t" - "mov %%r13, %c[r13](%[vcpu]) \n\t" - "mov %%r14, %c[r14](%[vcpu]) \n\t" - "mov %%r15, %c[r15](%[vcpu]) \n\t" + "mov %%rbx, %c[rbx](%[svm]) \n\t" + "mov %%rcx, %c[rcx](%[svm]) \n\t" + "mov %%rdx, %c[rdx](%[svm]) \n\t" + "mov %%rsi, %c[rsi](%[svm]) \n\t" + "mov %%rdi, %c[rdi](%[svm]) \n\t" + "mov %%rbp, %c[rbp](%[svm]) \n\t" + "mov %%r8, %c[r8](%[svm]) \n\t" + "mov %%r9, %c[r9](%[svm]) \n\t" + "mov %%r10, %c[r10](%[svm]) \n\t" + "mov %%r11, %c[r11](%[svm]) \n\t" + "mov %%r12, %c[r12](%[svm]) \n\t" + "mov %%r13, %c[r13](%[svm]) \n\t" + "mov %%r14, %c[r14](%[svm]) \n\t" + "mov %%r15, %c[r15](%[svm]) \n\t" "pop %%r15; pop %%r14; pop %%r13; pop %%r12;" "pop %%r11; pop %%r10; pop %%r9; pop %%r8;" "pop %%rbp; pop %%rdi; pop %%rsi;" "pop %%rdx; pop %%rcx; pop %%rbx; \n\t" #else - "mov %%ebx, %c[rbx](%[vcpu]) \n\t" - "mov %%ecx, %c[rcx](%[vcpu]) \n\t" - "mov %%edx, %c[rdx](%[vcpu]) \n\t" - "mov %%esi, %c[rsi](%[vcpu]) \n\t" - "mov %%edi, %c[rdi](%[vcpu]) \n\t" - "mov %%ebp, %c[rbp](%[vcpu]) \n\t" + "mov %%ebx, %c[rbx](%[svm]) \n\t" + "mov %%ecx, %c[rcx](%[svm]) \n\t" + "mov %%edx, %c[rdx](%[svm]) \n\t" + "mov %%esi, %c[rsi](%[svm]) \n\t" + "mov %%edi, %c[rdi](%[svm]) \n\t" + "mov %%ebp, %c[rbp](%[svm]) \n\t" "pop %%ebp; pop %%edi; pop %%esi;" "pop %%edx; pop %%ecx; pop %%ebx; \n\t" #endif : - : [vcpu]"a"(vcpu), - [svm]"i"(offsetof(struct kvm_vcpu, svm)), + : [svm]"a"(svm), [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)), - [rbx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBX])), - [rcx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RCX])), - [rdx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDX])), - [rsi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RSI])), - [rdi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDI])), - [rbp]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBP])) + [rbx]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RBX])), + [rcx]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RCX])), + [rdx]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RDX])), + [rsi]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RSI])), + [rdi]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RDI])), + [rbp]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_RBP])) #ifdef CONFIG_X86_64 - ,[r8 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R8 ])), - [r9 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R9 ])), - [r10]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R10])), - [r11]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R11])), - [r12]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R12])), - [r13]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R13])), - [r14]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R14])), - [r15]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R15])) + ,[r8 ]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R8])), + [r9 ]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R9 ])), + [r10]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R10])), + [r11]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R11])), + [r12]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R12])), + [r13]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R13])), + [r14]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R14])), + [r15]"i"(offsetof(struct vcpu_svm,vcpu.regs[VCPU_REGS_R15])) #endif : "cc", "memory" ); - vcpu->guest_mode = 0; + local_irq_disable(); - if (vcpu->fpu_active) { - fx_save(vcpu->guest_fx_image); - fx_restore(vcpu->host_fx_image); - } + stgi(); - if ((vcpu->svm->vmcb->save.dr7 & 0xff)) - load_db_regs(vcpu->svm->host_db_regs); + if ((svm->vmcb->save.dr7 & 0xff)) + load_db_regs(svm->host_db_regs); - vcpu->cr2 = vcpu->svm->vmcb->save.cr2; + vcpu->cr2 = svm->vmcb->save.cr2; - write_dr6(vcpu->svm->host_dr6); - write_dr7(vcpu->svm->host_dr7); - kvm_write_cr2(vcpu->svm->host_cr2); + write_dr6(svm->host_dr6); + write_dr7(svm->host_dr7); + kvm_write_cr2(svm->host_cr2); load_fs(fs_selector); load_gs(gs_selector); @@ -1652,57 +1614,19 @@ #endif reload_tss(vcpu); - /* - * Profile KVM exit RIPs: - */ - if (unlikely(prof_on == KVM_PROFILING)) - profile_hit(KVM_PROFILING, - (void *)(unsigned long)vcpu->svm->vmcb->save.rip); - - stgi(); - - kvm_reput_irq(vcpu); - - vcpu->svm->next_rip = 0; - - if (vcpu->svm->vmcb->control.exit_code == SVM_EXIT_ERR) { - kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; - kvm_run->fail_entry.hardware_entry_failure_reason - = vcpu->svm->vmcb->control.exit_code; - post_kvm_run_save(vcpu, kvm_run); - return 0; - } - - r = handle_exit(vcpu, kvm_run); - if (r > 0) { - if (signal_pending(current)) { - ++vcpu->stat.signal_exits; - post_kvm_run_save(vcpu, kvm_run); - kvm_run->exit_reason = KVM_EXIT_INTR; - return -EINTR; - } - - if (dm_request_for_irq_injection(vcpu, kvm_run)) { - ++vcpu->stat.request_irq_exits; - post_kvm_run_save(vcpu, kvm_run); - kvm_run->exit_reason = KVM_EXIT_INTR; - return -EINTR; - } - kvm_resched(vcpu); - goto again; - } - post_kvm_run_save(vcpu, kvm_run); - return r; + svm->next_rip = 0; } static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) { - vcpu->svm->vmcb->save.cr3 = root; + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.cr3 = root; force_new_asid(vcpu); if (vcpu->fpu_active) { - vcpu->svm->vmcb->control.intercept_exceptions |= (1 << NM_VECTOR); - vcpu->svm->vmcb->save.cr0 |= CR0_TS_MASK; + svm->vmcb->control.intercept_exceptions |= (1 << NM_VECTOR); + svm->vmcb->save.cr0 |= X86_CR0_TS; vcpu->fpu_active = 0; } } @@ -1711,26 +1635,27 @@ static void svm_inject_page_fault(struct unsigned long addr, uint32_t err_code) { - uint32_t exit_int_info = vcpu->svm->vmcb->control.exit_int_info; + struct vcpu_svm *svm = to_svm(vcpu); + uint32_t exit_int_info = svm->vmcb->control.exit_int_info; ++vcpu->stat.pf_guest; if (is_page_fault(exit_int_info)) { - vcpu->svm->vmcb->control.event_inj_err = 0; - vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | - SVM_EVTINJ_VALID_ERR | - SVM_EVTINJ_TYPE_EXEPT | - DF_VECTOR; + svm->vmcb->control.event_inj_err = 0; + svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_VALID_ERR | + SVM_EVTINJ_TYPE_EXEPT | + DF_VECTOR; return; } vcpu->cr2 = addr; - vcpu->svm->vmcb->save.cr2 = addr; - vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | - SVM_EVTINJ_VALID_ERR | - SVM_EVTINJ_TYPE_EXEPT | - PF_VECTOR; - vcpu->svm->vmcb->control.event_inj_err = err_code; + svm->vmcb->save.cr2 = addr; + svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_VALID_ERR | + SVM_EVTINJ_TYPE_EXEPT | + PF_VECTOR; + svm->vmcb->control.event_inj_err = err_code; } @@ -1754,20 +1679,27 @@ svm_patch_hypercall(struct kvm_vcpu *vcp hypercall[0] = 0x0f; hypercall[1] = 0x01; hypercall[2] = 0xd9; - hypercall[3] = 0xc3; } -static struct kvm_arch_ops svm_arch_ops = { +static void svm_check_processor_compat(void *rtn) +{ + *(int *)rtn = 0; +} + +static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, .hardware_setup = svm_hardware_setup, .hardware_unsetup = svm_hardware_unsetup, + .check_processor_compatibility = svm_check_processor_compat, .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, .vcpu_create = svm_create_vcpu, .vcpu_free = svm_free_vcpu, + .vcpu_reset = svm_vcpu_reset, + .prepare_guest_switch = svm_prepare_guest_switch, .vcpu_load = svm_vcpu_load, .vcpu_put = svm_vcpu_put, .vcpu_decache = svm_vcpu_decache, @@ -1778,7 +1710,7 @@ static struct kvm_arch_ops svm_arch_ops .get_segment_base = svm_get_segment_base, .get_segment = svm_get_segment, .set_segment = svm_set_segment, - .get_cs_db_l_bits = svm_get_cs_db_l_bits, + .get_cs_db_l_bits = kvm_get_cs_db_l_bits, .decache_cr4_guest_bits = svm_decache_cr4_guest_bits, .set_cr0 = svm_set_cr0, .set_cr3 = svm_set_cr3, @@ -1795,26 +1727,30 @@ static struct kvm_arch_ops svm_arch_ops .get_rflags = svm_get_rflags, .set_rflags = svm_set_rflags, - .invlpg = svm_invlpg, .tlb_flush = svm_flush_tlb, .inject_page_fault = svm_inject_page_fault, .inject_gp = svm_inject_gp, .run = svm_vcpu_run, + .handle_exit = handle_exit, .skip_emulated_instruction = skip_emulated_instruction, - .vcpu_setup = svm_vcpu_setup, .patch_hypercall = svm_patch_hypercall, + .get_irq = svm_get_irq, + .set_irq = svm_set_irq, + .inject_pending_irq = svm_intr_assist, + .inject_pending_vectors = do_interrupt_requests, }; static int __init svm_init(void) { - return kvm_init_arch(&svm_arch_ops, THIS_MODULE); + return kvm_init_x86(&svm_x86_ops, sizeof(struct vcpu_svm), + THIS_MODULE); } static void __exit svm_exit(void) { - kvm_exit_arch(); + kvm_exit_x86(); } module_init(svm_init) diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index 80628f6..6f1ad90 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -16,6 +16,8 @@ */ #include "kvm.h" +#include "x86_emulate.h" +#include "irq.h" #include "vmx.h" #include "segment_descriptor.h" @@ -23,8 +25,8 @@ #include #include #include #include -#include #include +#include #include #include @@ -32,6 +34,43 @@ #include MODULE_AUTHOR("Qumranet"); MODULE_LICENSE("GPL"); +static int bypass_guest_pf = 1; +module_param(bypass_guest_pf, bool, 0); + +struct vmcs { + u32 revision_id; + u32 abort; + char data[0]; +}; + +struct vcpu_vmx { + struct kvm_vcpu vcpu; + int launched; + u8 fail; + struct kvm_msr_entry *guest_msrs; + struct kvm_msr_entry *host_msrs; + int nmsrs; + int save_nmsrs; + int msr_offset_efer; +#ifdef CONFIG_X86_64 + int msr_offset_kernel_gs_base; +#endif + struct vmcs *vmcs; + struct { + int loaded; + u16 fs_sel, gs_sel, ldt_sel; + int gs_ldt_reload_needed; + int fs_reload_needed; + int guest_efer_loaded; + }host_state; + +}; + +static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) +{ + return container_of(vcpu, struct vcpu_vmx, vcpu); +} + static int init_rmode_tss(struct kvm *kvm); static DEFINE_PER_CPU(struct vmcs *, vmxarea); @@ -40,18 +79,15 @@ static DEFINE_PER_CPU(struct vmcs *, cur static struct page *vmx_io_bitmap_a; static struct page *vmx_io_bitmap_b; -#ifdef CONFIG_X86_64 -#define HOST_IS_64 1 -#else -#define HOST_IS_64 0 -#endif -#define EFER_SAVE_RESTORE_BITS ((u64)EFER_SCE) - -static struct vmcs_descriptor { +static struct vmcs_config { int size; int order; u32 revision_id; -} vmcs_descriptor; + u32 pin_based_exec_ctrl; + u32 cpu_based_exec_ctrl; + u32 vmexit_ctrl; + u32 vmentry_ctrl; +} vmcs_config; #define VMX_SEGMENT_FIELD(seg) \ [VCPU_SREG_##seg] = { \ @@ -89,16 +125,20 @@ #endif }; #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index) -static inline u64 msr_efer_save_restore_bits(struct vmx_msr_entry msr) +static void load_msrs(struct kvm_msr_entry *e, int n) { - return (u64)msr.data & EFER_SAVE_RESTORE_BITS; + int i; + + for (i = 0; i < n; ++i) + wrmsrl(e[i].index, e[i].data); } -static inline int msr_efer_need_save_restore(struct kvm_vcpu *vcpu) +static void save_msrs(struct kvm_msr_entry *e, int n) { - int efer_offset = vcpu->msr_offset_efer; - return msr_efer_save_restore_bits(vcpu->host_msrs[efer_offset]) != - msr_efer_save_restore_bits(vcpu->guest_msrs[efer_offset]); + int i; + + for (i = 0; i < n; ++i) + rdmsrl(e[i].index, e[i].data); } static inline int is_page_fault(u32 intr_info) @@ -115,29 +155,46 @@ static inline int is_no_device(u32 intr_ (INTR_TYPE_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK); } +static inline int is_invalid_opcode(u32 intr_info) +{ + return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | + INTR_INFO_VALID_MASK)) == + (INTR_TYPE_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK); +} + static inline int is_external_interrupt(u32 intr_info) { return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); } -static int __find_msr_index(struct kvm_vcpu *vcpu, u32 msr) +static inline int cpu_has_vmx_tpr_shadow(void) +{ + return (vmcs_config.cpu_based_exec_ctrl & CPU_BASED_TPR_SHADOW); +} + +static inline int vm_need_tpr_shadow(struct kvm *kvm) +{ + return ((cpu_has_vmx_tpr_shadow()) && (irqchip_in_kernel(kvm))); +} + +static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr) { int i; - for (i = 0; i < vcpu->nmsrs; ++i) - if (vcpu->guest_msrs[i].index == msr) + for (i = 0; i < vmx->nmsrs; ++i) + if (vmx->guest_msrs[i].index == msr) return i; return -1; } -static struct vmx_msr_entry *find_msr_entry(struct kvm_vcpu *vcpu, u32 msr) +static struct kvm_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr) { int i; - i = __find_msr_index(vcpu, msr); + i = __find_msr_index(vmx, msr); if (i >= 0) - return &vcpu->guest_msrs[i]; + return &vmx->guest_msrs[i]; return NULL; } @@ -156,23 +213,24 @@ static void vmcs_clear(struct vmcs *vmcs static void __vcpu_clear(void *arg) { - struct kvm_vcpu *vcpu = arg; + struct vcpu_vmx *vmx = arg; int cpu = raw_smp_processor_id(); - if (vcpu->cpu == cpu) - vmcs_clear(vcpu->vmcs); - if (per_cpu(current_vmcs, cpu) == vcpu->vmcs) + if (vmx->vcpu.cpu == cpu) + vmcs_clear(vmx->vmcs); + if (per_cpu(current_vmcs, cpu) == vmx->vmcs) per_cpu(current_vmcs, cpu) = NULL; - rdtscll(vcpu->host_tsc); + rdtscll(vmx->vcpu.host_tsc); } -static void vcpu_clear(struct kvm_vcpu *vcpu) +static void vcpu_clear(struct vcpu_vmx *vmx) { - if (vcpu->cpu != raw_smp_processor_id() && vcpu->cpu != -1) - smp_call_function_single(vcpu->cpu, __vcpu_clear, vcpu, 0, 1); + if (vmx->vcpu.cpu != raw_smp_processor_id() && vmx->vcpu.cpu != -1) + smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, + vmx, 0, 1); else - __vcpu_clear(vcpu); - vcpu->launched = 0; + __vcpu_clear(vmx); + vmx->launched = 0; } static unsigned long vmcs_readl(unsigned long field) @@ -255,7 +313,7 @@ static void update_exception_bitmap(stru { u32 eb; - eb = 1u << PF_VECTOR; + eb = (1u << PF_VECTOR) | (1u << UD_VECTOR); if (!vcpu->fpu_active) eb |= 1u << NM_VECTOR; if (vcpu->guest_debug.enabled) @@ -282,121 +340,144 @@ #ifndef CONFIG_X86_64 #endif } -static void load_transition_efer(struct kvm_vcpu *vcpu) +static void load_transition_efer(struct vcpu_vmx *vmx) { - u64 trans_efer; - int efer_offset = vcpu->msr_offset_efer; + int efer_offset = vmx->msr_offset_efer; + u64 host_efer = vmx->host_msrs[efer_offset].data; + u64 guest_efer = vmx->guest_msrs[efer_offset].data; + u64 ignore_bits; - trans_efer = vcpu->host_msrs[efer_offset].data; - trans_efer &= ~EFER_SAVE_RESTORE_BITS; - trans_efer |= msr_efer_save_restore_bits( - vcpu->guest_msrs[efer_offset]); - wrmsrl(MSR_EFER, trans_efer); - vcpu->stat.efer_reload++; + /* + * NX is emulated; LMA and LME handled by hardware; SCE meaninless + * outside long mode + */ + ignore_bits = EFER_NX | EFER_SCE; +#ifdef CONFIG_X86_64 + ignore_bits |= EFER_LMA | EFER_LME; + /* SCE is meaningful only in long mode on Intel */ + if (guest_efer & EFER_LMA) + ignore_bits &= ~(u64)EFER_SCE; +#endif + if ((guest_efer & ~ignore_bits) == (host_efer & ~ignore_bits)) + return; + + vmx->host_state.guest_efer_loaded = 1; + guest_efer &= ~ignore_bits; + guest_efer |= host_efer & ignore_bits; + wrmsrl(MSR_EFER, guest_efer); + vmx->vcpu.stat.efer_reload++; +} + +static void reload_host_efer(struct vcpu_vmx *vmx) +{ + if (vmx->host_state.guest_efer_loaded) { + vmx->host_state.guest_efer_loaded = 0; + load_msrs(vmx->host_msrs + vmx->msr_offset_efer, 1); + } } static void vmx_save_host_state(struct kvm_vcpu *vcpu) { - struct vmx_host_state *hs = &vcpu->vmx_host_state; + struct vcpu_vmx *vmx = to_vmx(vcpu); - if (hs->loaded) + if (vmx->host_state.loaded) return; - hs->loaded = 1; + vmx->host_state.loaded = 1; /* * Set host fs and gs selectors. Unfortunately, 22.2.3 does not * allow segment selectors with cpl > 0 or ti == 1. */ - hs->ldt_sel = read_ldt(); - hs->fs_gs_ldt_reload_needed = hs->ldt_sel; - hs->fs_sel = read_fs(); - if (!(hs->fs_sel & 7)) - vmcs_write16(HOST_FS_SELECTOR, hs->fs_sel); - else { + vmx->host_state.ldt_sel = read_ldt(); + vmx->host_state.gs_ldt_reload_needed = vmx->host_state.ldt_sel; + vmx->host_state.fs_sel = read_fs(); + if (!(vmx->host_state.fs_sel & 7)) { + vmcs_write16(HOST_FS_SELECTOR, vmx->host_state.fs_sel); + vmx->host_state.fs_reload_needed = 0; + } else { vmcs_write16(HOST_FS_SELECTOR, 0); - hs->fs_gs_ldt_reload_needed = 1; + vmx->host_state.fs_reload_needed = 1; } - hs->gs_sel = read_gs(); - if (!(hs->gs_sel & 7)) - vmcs_write16(HOST_GS_SELECTOR, hs->gs_sel); + vmx->host_state.gs_sel = read_gs(); + if (!(vmx->host_state.gs_sel & 7)) + vmcs_write16(HOST_GS_SELECTOR, vmx->host_state.gs_sel); else { vmcs_write16(HOST_GS_SELECTOR, 0); - hs->fs_gs_ldt_reload_needed = 1; + vmx->host_state.gs_ldt_reload_needed = 1; } #ifdef CONFIG_X86_64 vmcs_writel(HOST_FS_BASE, read_msr(MSR_FS_BASE)); vmcs_writel(HOST_GS_BASE, read_msr(MSR_GS_BASE)); #else - vmcs_writel(HOST_FS_BASE, segment_base(hs->fs_sel)); - vmcs_writel(HOST_GS_BASE, segment_base(hs->gs_sel)); + vmcs_writel(HOST_FS_BASE, segment_base(vmx->host_state.fs_sel)); + vmcs_writel(HOST_GS_BASE, segment_base(vmx->host_state.gs_sel)); #endif #ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - save_msrs(vcpu->host_msrs + vcpu->msr_offset_kernel_gs_base, 1); + if (is_long_mode(&vmx->vcpu)) { + save_msrs(vmx->host_msrs + + vmx->msr_offset_kernel_gs_base, 1); } #endif - load_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); - if (msr_efer_need_save_restore(vcpu)) - load_transition_efer(vcpu); + load_msrs(vmx->guest_msrs, vmx->save_nmsrs); + load_transition_efer(vmx); } -static void vmx_load_host_state(struct kvm_vcpu *vcpu) +static void vmx_load_host_state(struct vcpu_vmx *vmx) { - struct vmx_host_state *hs = &vcpu->vmx_host_state; + unsigned long flags; - if (!hs->loaded) + if (!vmx->host_state.loaded) return; - hs->loaded = 0; - if (hs->fs_gs_ldt_reload_needed) { - load_ldt(hs->ldt_sel); - load_fs(hs->fs_sel); + vmx->host_state.loaded = 0; + if (vmx->host_state.fs_reload_needed) + load_fs(vmx->host_state.fs_sel); + if (vmx->host_state.gs_ldt_reload_needed) { + load_ldt(vmx->host_state.ldt_sel); /* * If we have to reload gs, we must take care to * preserve our gs base. */ - local_irq_disable(); - load_gs(hs->gs_sel); + local_irq_save(flags); + load_gs(vmx->host_state.gs_sel); #ifdef CONFIG_X86_64 wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); #endif - local_irq_enable(); - - reload_tss(); + local_irq_restore(flags); } - save_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); - load_msrs(vcpu->host_msrs, vcpu->save_nmsrs); - if (msr_efer_need_save_restore(vcpu)) - load_msrs(vcpu->host_msrs + vcpu->msr_offset_efer, 1); + reload_tss(); + save_msrs(vmx->guest_msrs, vmx->save_nmsrs); + load_msrs(vmx->host_msrs, vmx->save_nmsrs); + reload_host_efer(vmx); } /* * Switches to specified vcpu, until a matching vcpu_put(), but assumes * vcpu mutex is already taken. */ -static void vmx_vcpu_load(struct kvm_vcpu *vcpu) +static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { - u64 phys_addr = __pa(vcpu->vmcs); - int cpu; + struct vcpu_vmx *vmx = to_vmx(vcpu); + u64 phys_addr = __pa(vmx->vmcs); u64 tsc_this, delta; - cpu = get_cpu(); - - if (vcpu->cpu != cpu) - vcpu_clear(vcpu); + if (vcpu->cpu != cpu) { + vcpu_clear(vmx); + kvm_migrate_apic_timer(vcpu); + } - if (per_cpu(current_vmcs, cpu) != vcpu->vmcs) { + if (per_cpu(current_vmcs, cpu) != vmx->vmcs) { u8 error; - per_cpu(current_vmcs, cpu) = vcpu->vmcs; + per_cpu(current_vmcs, cpu) = vmx->vmcs; asm volatile (ASM_VMX_VMPTRLD_RAX "; setna %0" : "=g"(error) : "a"(&phys_addr), "m"(phys_addr) : "cc"); if (error) printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n", - vcpu->vmcs, phys_addr); + vmx->vmcs, phys_addr); } if (vcpu->cpu != cpu) { @@ -426,9 +507,8 @@ static void vmx_vcpu_load(struct kvm_vcp static void vmx_vcpu_put(struct kvm_vcpu *vcpu) { - vmx_load_host_state(vcpu); + vmx_load_host_state(to_vmx(vcpu)); kvm_put_guest_fpu(vcpu); - put_cpu(); } static void vmx_fpu_activate(struct kvm_vcpu *vcpu) @@ -436,9 +516,9 @@ static void vmx_fpu_activate(struct kvm_ if (vcpu->fpu_active) return; vcpu->fpu_active = 1; - vmcs_clear_bits(GUEST_CR0, CR0_TS_MASK); - if (vcpu->cr0 & CR0_TS_MASK) - vmcs_set_bits(GUEST_CR0, CR0_TS_MASK); + vmcs_clear_bits(GUEST_CR0, X86_CR0_TS); + if (vcpu->cr0 & X86_CR0_TS) + vmcs_set_bits(GUEST_CR0, X86_CR0_TS); update_exception_bitmap(vcpu); } @@ -447,13 +527,13 @@ static void vmx_fpu_deactivate(struct kv if (!vcpu->fpu_active) return; vcpu->fpu_active = 0; - vmcs_set_bits(GUEST_CR0, CR0_TS_MASK); + vmcs_set_bits(GUEST_CR0, X86_CR0_TS); update_exception_bitmap(vcpu); } static void vmx_vcpu_decache(struct kvm_vcpu *vcpu) { - vcpu_clear(vcpu); + vcpu_clear(to_vmx(vcpu)); } static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) @@ -498,62 +578,73 @@ static void vmx_inject_gp(struct kvm_vcp INTR_INFO_VALID_MASK); } +static void vmx_inject_ud(struct kvm_vcpu *vcpu) +{ + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + UD_VECTOR | + INTR_TYPE_EXCEPTION | + INTR_INFO_VALID_MASK); +} + /* * Swap MSR entry in host/guest MSR entry array. */ -void move_msr_up(struct kvm_vcpu *vcpu, int from, int to) +#ifdef CONFIG_X86_64 +static void move_msr_up(struct vcpu_vmx *vmx, int from, int to) { - struct vmx_msr_entry tmp; - tmp = vcpu->guest_msrs[to]; - vcpu->guest_msrs[to] = vcpu->guest_msrs[from]; - vcpu->guest_msrs[from] = tmp; - tmp = vcpu->host_msrs[to]; - vcpu->host_msrs[to] = vcpu->host_msrs[from]; - vcpu->host_msrs[from] = tmp; + struct kvm_msr_entry tmp; + + tmp = vmx->guest_msrs[to]; + vmx->guest_msrs[to] = vmx->guest_msrs[from]; + vmx->guest_msrs[from] = tmp; + tmp = vmx->host_msrs[to]; + vmx->host_msrs[to] = vmx->host_msrs[from]; + vmx->host_msrs[from] = tmp; } +#endif /* * Set up the vmcs to automatically save and restore system * msrs. Don't touch the 64-bit msrs if the guest is in legacy * mode, as fiddling with msrs is very expensive. */ -static void setup_msrs(struct kvm_vcpu *vcpu) +static void setup_msrs(struct vcpu_vmx *vmx) { int save_nmsrs; save_nmsrs = 0; #ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { + if (is_long_mode(&vmx->vcpu)) { int index; - index = __find_msr_index(vcpu, MSR_SYSCALL_MASK); + index = __find_msr_index(vmx, MSR_SYSCALL_MASK); if (index >= 0) - move_msr_up(vcpu, index, save_nmsrs++); - index = __find_msr_index(vcpu, MSR_LSTAR); + move_msr_up(vmx, index, save_nmsrs++); + index = __find_msr_index(vmx, MSR_LSTAR); if (index >= 0) - move_msr_up(vcpu, index, save_nmsrs++); - index = __find_msr_index(vcpu, MSR_CSTAR); + move_msr_up(vmx, index, save_nmsrs++); + index = __find_msr_index(vmx, MSR_CSTAR); if (index >= 0) - move_msr_up(vcpu, index, save_nmsrs++); - index = __find_msr_index(vcpu, MSR_KERNEL_GS_BASE); + move_msr_up(vmx, index, save_nmsrs++); + index = __find_msr_index(vmx, MSR_KERNEL_GS_BASE); if (index >= 0) - move_msr_up(vcpu, index, save_nmsrs++); + move_msr_up(vmx, index, save_nmsrs++); /* * MSR_K6_STAR is only needed on long mode guests, and only * if efer.sce is enabled. */ - index = __find_msr_index(vcpu, MSR_K6_STAR); - if ((index >= 0) && (vcpu->shadow_efer & EFER_SCE)) - move_msr_up(vcpu, index, save_nmsrs++); + index = __find_msr_index(vmx, MSR_K6_STAR); + if ((index >= 0) && (vmx->vcpu.shadow_efer & EFER_SCE)) + move_msr_up(vmx, index, save_nmsrs++); } #endif - vcpu->save_nmsrs = save_nmsrs; + vmx->save_nmsrs = save_nmsrs; #ifdef CONFIG_X86_64 - vcpu->msr_offset_kernel_gs_base = - __find_msr_index(vcpu, MSR_KERNEL_GS_BASE); + vmx->msr_offset_kernel_gs_base = + __find_msr_index(vmx, MSR_KERNEL_GS_BASE); #endif - vcpu->msr_offset_efer = __find_msr_index(vcpu, MSR_EFER); + vmx->msr_offset_efer = __find_msr_index(vmx, MSR_EFER); } /* @@ -589,7 +680,7 @@ static void guest_write_tsc(u64 guest_ts static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) { u64 data; - struct vmx_msr_entry *msr; + struct kvm_msr_entry *msr; if (!pdata) { printk(KERN_ERR "BUG: get_msr called with NULL pdata\n"); @@ -620,7 +711,7 @@ #endif data = vmcs_readl(GUEST_SYSENTER_ESP); break; default: - msr = find_msr_entry(vcpu, msr_index); + msr = find_msr_entry(to_vmx(vcpu), msr_index); if (msr) { data = msr->data; break; @@ -639,15 +730,18 @@ #endif */ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) { - struct vmx_msr_entry *msr; + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_msr_entry *msr; int ret = 0; switch (msr_index) { #ifdef CONFIG_X86_64 case MSR_EFER: ret = kvm_set_msr_common(vcpu, msr_index, data); - if (vcpu->vmx_host_state.loaded) - load_transition_efer(vcpu); + if (vmx->host_state.loaded) { + reload_host_efer(vmx); + load_transition_efer(vmx); + } break; case MSR_FS_BASE: vmcs_writel(GUEST_FS_BASE, data); @@ -669,11 +763,11 @@ #endif guest_write_tsc(data); break; default: - msr = find_msr_entry(vcpu, msr_index); + msr = find_msr_entry(vmx, msr_index); if (msr) { msr->data = data; - if (vcpu->vmx_host_state.loaded) - load_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); + if (vmx->host_state.loaded) + load_msrs(vmx->guest_msrs, vmx->save_nmsrs); break; } ret = kvm_set_msr_common(vcpu, msr_index, data); @@ -740,6 +834,20 @@ static int set_guest_debug(struct kvm_vc return 0; } +static int vmx_get_irq(struct kvm_vcpu *vcpu) +{ + u32 idtv_info_field; + + idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD); + if (idtv_info_field & INTR_INFO_VALID_MASK) { + if (is_external_interrupt(idtv_info_field)) + return idtv_info_field & VECTORING_INFO_VECTOR_MASK; + else + printk("pending exception: not handled yet\n"); + } + return -1; +} + static __init int cpu_has_kvm_support(void) { unsigned long ecx = cpuid_ecx(1); @@ -751,7 +859,10 @@ static __init int vmx_disabled_by_bios(v u64 msr; rdmsrl(MSR_IA32_FEATURE_CONTROL, msr); - return (msr & 5) == 1; /* locked but not enabled */ + return (msr & (MSR_IA32_FEATURE_CONTROL_LOCKED | + MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED)) + == MSR_IA32_FEATURE_CONTROL_LOCKED; + /* locked but not enabled */ } static void hardware_enable(void *garbage) @@ -761,10 +872,15 @@ static void hardware_enable(void *garbag u64 old; rdmsrl(MSR_IA32_FEATURE_CONTROL, old); - if ((old & 5) != 5) + if ((old & (MSR_IA32_FEATURE_CONTROL_LOCKED | + MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED)) + != (MSR_IA32_FEATURE_CONTROL_LOCKED | + MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED)) /* enable and lock */ - wrmsrl(MSR_IA32_FEATURE_CONTROL, old | 5); - write_cr4(read_cr4() | CR4_VMXE); /* FIXME: not cpu hotplug safe */ + wrmsrl(MSR_IA32_FEATURE_CONTROL, old | + MSR_IA32_FEATURE_CONTROL_LOCKED | + MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED); + write_cr4(read_cr4() | X86_CR4_VMXE); /* FIXME: not cpu hotplug safe */ asm volatile (ASM_VMX_VMXON_RAX : : "a"(&phys_addr), "m"(phys_addr) : "memory", "cc"); } @@ -774,14 +890,102 @@ static void hardware_disable(void *garba asm volatile (ASM_VMX_VMXOFF : : : "cc"); } -static __init void setup_vmcs_descriptor(void) +static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt, + u32 msr, u32* result) { u32 vmx_msr_low, vmx_msr_high; + u32 ctl = ctl_min | ctl_opt; + + rdmsr(msr, vmx_msr_low, vmx_msr_high); + + ctl &= vmx_msr_high; /* bit == 0 in high word ==> must be zero */ + ctl |= vmx_msr_low; /* bit == 1 in low word ==> must be one */ + + /* Ensure minimum (required) set of control bits are supported. */ + if (ctl_min & ~ctl) + return -EIO; + + *result = ctl; + return 0; +} + +static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) +{ + u32 vmx_msr_low, vmx_msr_high; + u32 min, opt; + u32 _pin_based_exec_control = 0; + u32 _cpu_based_exec_control = 0; + u32 _vmexit_control = 0; + u32 _vmentry_control = 0; + + min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; + opt = 0; + if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS, + &_pin_based_exec_control) < 0) + return -EIO; + + min = CPU_BASED_HLT_EXITING | +#ifdef CONFIG_X86_64 + CPU_BASED_CR8_LOAD_EXITING | + CPU_BASED_CR8_STORE_EXITING | +#endif + CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_MOV_DR_EXITING | + CPU_BASED_USE_TSC_OFFSETING; +#ifdef CONFIG_X86_64 + opt = CPU_BASED_TPR_SHADOW; +#else + opt = 0; +#endif + if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS, + &_cpu_based_exec_control) < 0) + return -EIO; +#ifdef CONFIG_X86_64 + if ((_cpu_based_exec_control & CPU_BASED_TPR_SHADOW)) + _cpu_based_exec_control &= ~CPU_BASED_CR8_LOAD_EXITING & + ~CPU_BASED_CR8_STORE_EXITING; +#endif + + min = 0; +#ifdef CONFIG_X86_64 + min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; +#endif + opt = 0; + if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS, + &_vmexit_control) < 0) + return -EIO; + + min = opt = 0; + if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, + &_vmentry_control) < 0) + return -EIO; rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high); - vmcs_descriptor.size = vmx_msr_high & 0x1fff; - vmcs_descriptor.order = get_order(vmcs_descriptor.size); - vmcs_descriptor.revision_id = vmx_msr_low; + + /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */ + if ((vmx_msr_high & 0x1fff) > PAGE_SIZE) + return -EIO; + +#ifdef CONFIG_X86_64 + /* IA-32 SDM Vol 3B: 64-bit CPUs always have VMX_BASIC_MSR[48]==0. */ + if (vmx_msr_high & (1u<<16)) + return -EIO; +#endif + + /* Require Write-Back (WB) memory type for VMCS accesses. */ + if (((vmx_msr_high >> 18) & 15) != 6) + return -EIO; + + vmcs_conf->size = vmx_msr_high & 0x1fff; + vmcs_conf->order = get_order(vmcs_config.size); + vmcs_conf->revision_id = vmx_msr_low; + + vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control; + vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control; + vmcs_conf->vmexit_ctrl = _vmexit_control; + vmcs_conf->vmentry_ctrl = _vmentry_control; + + return 0; } static struct vmcs *alloc_vmcs_cpu(int cpu) @@ -790,12 +994,12 @@ static struct vmcs *alloc_vmcs_cpu(int c struct page *pages; struct vmcs *vmcs; - pages = alloc_pages_node(node, GFP_KERNEL, vmcs_descriptor.order); + pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order); if (!pages) return NULL; vmcs = page_address(pages); - memset(vmcs, 0, vmcs_descriptor.size); - vmcs->revision_id = vmcs_descriptor.revision_id; /* vmcs revision id */ + memset(vmcs, 0, vmcs_config.size); + vmcs->revision_id = vmcs_config.revision_id; /* vmcs revision id */ return vmcs; } @@ -806,7 +1010,7 @@ static struct vmcs *alloc_vmcs(void) static void free_vmcs(struct vmcs *vmcs) { - free_pages((unsigned long)vmcs, vmcs_descriptor.order); + free_pages((unsigned long)vmcs, vmcs_config.order); } static void free_kvm_area(void) @@ -817,8 +1021,6 @@ static void free_kvm_area(void) free_vmcs(per_cpu(vmxarea, cpu)); } -extern struct vmcs *alloc_vmcs_cpu(int cpu); - static __init int alloc_kvm_area(void) { int cpu; @@ -839,7 +1041,8 @@ static __init int alloc_kvm_area(void) static __init int hardware_setup(void) { - setup_vmcs_descriptor(); + if (setup_vmcs_config(&vmcs_config) < 0) + return -EIO; return alloc_kvm_area(); } @@ -879,8 +1082,8 @@ static void enter_pmode(struct kvm_vcpu flags |= (vcpu->rmode.save_iopl << IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); - vmcs_writel(GUEST_CR4, (vmcs_readl(GUEST_CR4) & ~CR4_VME_MASK) | - (vmcs_readl(CR4_READ_SHADOW) & CR4_VME_MASK)); + vmcs_writel(GUEST_CR4, (vmcs_readl(GUEST_CR4) & ~X86_CR4_VME) | + (vmcs_readl(CR4_READ_SHADOW) & X86_CR4_VME)); update_exception_bitmap(vcpu); @@ -897,7 +1100,7 @@ static void enter_pmode(struct kvm_vcpu vmcs_write32(GUEST_CS_AR_BYTES, 0x9b); } -static int rmode_tss_base(struct kvm* kvm) +static gva_t rmode_tss_base(struct kvm* kvm) { gfn_t base_gfn = kvm->memslots[0].base_gfn + kvm->memslots[0].npages - 3; return base_gfn << PAGE_SHIFT; @@ -937,7 +1140,7 @@ static void enter_rmode(struct kvm_vcpu flags |= IOPL_MASK | X86_EFLAGS_VM; vmcs_writel(GUEST_RFLAGS, flags); - vmcs_writel(GUEST_CR4, vmcs_readl(GUEST_CR4) | CR4_VME_MASK); + vmcs_writel(GUEST_CR4, vmcs_readl(GUEST_CR4) | X86_CR4_VME); update_exception_bitmap(vcpu); vmcs_write16(GUEST_SS_SELECTOR, vmcs_readl(GUEST_SS_BASE) >> 4); @@ -975,10 +1178,10 @@ static void enter_lmode(struct kvm_vcpu vcpu->shadow_efer |= EFER_LMA; - find_msr_entry(vcpu, MSR_EFER)->data |= EFER_LMA | EFER_LME; + find_msr_entry(to_vmx(vcpu), MSR_EFER)->data |= EFER_LMA | EFER_LME; vmcs_write32(VM_ENTRY_CONTROLS, vmcs_read32(VM_ENTRY_CONTROLS) - | VM_ENTRY_CONTROLS_IA32E_MASK); + | VM_ENTRY_IA32E_MODE); } static void exit_lmode(struct kvm_vcpu *vcpu) @@ -987,7 +1190,7 @@ static void exit_lmode(struct kvm_vcpu * vmcs_write32(VM_ENTRY_CONTROLS, vmcs_read32(VM_ENTRY_CONTROLS) - & ~VM_ENTRY_CONTROLS_IA32E_MASK); + & ~VM_ENTRY_IA32E_MODE); } #endif @@ -1002,17 +1205,17 @@ static void vmx_set_cr0(struct kvm_vcpu { vmx_fpu_deactivate(vcpu); - if (vcpu->rmode.active && (cr0 & CR0_PE_MASK)) + if (vcpu->rmode.active && (cr0 & X86_CR0_PE)) enter_pmode(vcpu); - if (!vcpu->rmode.active && !(cr0 & CR0_PE_MASK)) + if (!vcpu->rmode.active && !(cr0 & X86_CR0_PE)) enter_rmode(vcpu); #ifdef CONFIG_X86_64 if (vcpu->shadow_efer & EFER_LME) { - if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) + if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) enter_lmode(vcpu); - if (is_paging(vcpu) && !(cr0 & CR0_PG_MASK)) + if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) exit_lmode(vcpu); } #endif @@ -1022,14 +1225,14 @@ #endif (cr0 & ~KVM_GUEST_CR0_MASK) | KVM_VM_CR0_ALWAYS_ON); vcpu->cr0 = cr0; - if (!(cr0 & CR0_TS_MASK) || !(cr0 & CR0_PE_MASK)) + if (!(cr0 & X86_CR0_TS) || !(cr0 & X86_CR0_PE)) vmx_fpu_activate(vcpu); } static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) { vmcs_writel(GUEST_CR3, cr3); - if (vcpu->cr0 & CR0_PE_MASK) + if (vcpu->cr0 & X86_CR0_PE) vmx_fpu_deactivate(vcpu); } @@ -1045,23 +1248,24 @@ #ifdef CONFIG_X86_64 static void vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer) { - struct vmx_msr_entry *msr = find_msr_entry(vcpu, MSR_EFER); + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_msr_entry *msr = find_msr_entry(vmx, MSR_EFER); vcpu->shadow_efer = efer; if (efer & EFER_LMA) { vmcs_write32(VM_ENTRY_CONTROLS, vmcs_read32(VM_ENTRY_CONTROLS) | - VM_ENTRY_CONTROLS_IA32E_MASK); + VM_ENTRY_IA32E_MODE); msr->data = efer; } else { vmcs_write32(VM_ENTRY_CONTROLS, vmcs_read32(VM_ENTRY_CONTROLS) & - ~VM_ENTRY_CONTROLS_IA32E_MASK); + ~VM_ENTRY_IA32E_MODE); msr->data = efer & ~EFER_LME; } - setup_msrs(vcpu); + setup_msrs(vmx); } #endif @@ -1210,17 +1414,6 @@ static int init_rmode_tss(struct kvm* kv return 1; } -static void vmcs_write32_fixedbits(u32 msr, u32 vmcs_field, u32 val) -{ - u32 msr_high, msr_low; - - rdmsr(msr, msr_low, msr_high); - - val &= msr_high; - val |= msr_low; - vmcs_write32(vmcs_field, val); -} - static void seg_setup(int seg) { struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; @@ -1234,7 +1427,7 @@ static void seg_setup(int seg) /* * Sets up the vmcs for emulated real mode. */ -static int vmx_vcpu_setup(struct kvm_vcpu *vcpu) +static int vmx_vcpu_setup(struct vcpu_vmx *vmx) { u32 host_sysenter_cs; u32 junk; @@ -1243,27 +1436,36 @@ static int vmx_vcpu_setup(struct kvm_vcp int i; int ret = 0; unsigned long kvm_vmx_return; + u64 msr; + u32 exec_control; - if (!init_rmode_tss(vcpu->kvm)) { + if (!init_rmode_tss(vmx->vcpu.kvm)) { ret = -ENOMEM; goto out; } - memset(vcpu->regs, 0, sizeof(vcpu->regs)); - vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val(); - vcpu->cr8 = 0; - vcpu->apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; - if (vcpu == &vcpu->kvm->vcpus[0]) - vcpu->apic_base |= MSR_IA32_APICBASE_BSP; + vmx->vcpu.rmode.active = 0; - fx_init(vcpu); + vmx->vcpu.regs[VCPU_REGS_RDX] = get_rdx_init_val(); + set_cr8(&vmx->vcpu, 0); + msr = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; + if (vmx->vcpu.vcpu_id == 0) + msr |= MSR_IA32_APICBASE_BSP; + kvm_set_apic_base(&vmx->vcpu, msr); + + fx_init(&vmx->vcpu); /* * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4. Sigh. */ - vmcs_write16(GUEST_CS_SELECTOR, 0xf000); - vmcs_writel(GUEST_CS_BASE, 0x000f0000); + if (vmx->vcpu.vcpu_id == 0) { + vmcs_write16(GUEST_CS_SELECTOR, 0xf000); + vmcs_writel(GUEST_CS_BASE, 0x000f0000); + } else { + vmcs_write16(GUEST_CS_SELECTOR, vmx->vcpu.sipi_vector << 8); + vmcs_writel(GUEST_CS_BASE, vmx->vcpu.sipi_vector << 12); + } vmcs_write32(GUEST_CS_LIMIT, 0xffff); vmcs_write32(GUEST_CS_AR_BYTES, 0x9b); @@ -1288,7 +1490,10 @@ static int vmx_vcpu_setup(struct kvm_vcp vmcs_writel(GUEST_SYSENTER_EIP, 0); vmcs_writel(GUEST_RFLAGS, 0x02); - vmcs_writel(GUEST_RIP, 0xfff0); + if (vmx->vcpu.vcpu_id == 0) + vmcs_writel(GUEST_RIP, 0xfff0); + else + vmcs_writel(GUEST_RIP, 0); vmcs_writel(GUEST_RSP, 0); //todo: dr0 = dr1 = dr2 = dr3 = 0; dr6 = 0xffff0ff0 @@ -1316,23 +1521,21 @@ static int vmx_vcpu_setup(struct kvm_vcp vmcs_write64(GUEST_IA32_DEBUGCTL, 0); /* Control */ - vmcs_write32_fixedbits(MSR_IA32_VMX_PINBASED_CTLS, - PIN_BASED_VM_EXEC_CONTROL, - PIN_BASED_EXT_INTR_MASK /* 20.6.1 */ - | PIN_BASED_NMI_EXITING /* 20.6.1 */ - ); - vmcs_write32_fixedbits(MSR_IA32_VMX_PROCBASED_CTLS, - CPU_BASED_VM_EXEC_CONTROL, - CPU_BASED_HLT_EXITING /* 20.6.2 */ - | CPU_BASED_CR8_LOAD_EXITING /* 20.6.2 */ - | CPU_BASED_CR8_STORE_EXITING /* 20.6.2 */ - | CPU_BASED_ACTIVATE_IO_BITMAP /* 20.6.2 */ - | CPU_BASED_MOV_DR_EXITING - | CPU_BASED_USE_TSC_OFFSETING /* 21.3 */ - ); - - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); + vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, + vmcs_config.pin_based_exec_ctrl); + + exec_control = vmcs_config.cpu_based_exec_ctrl; + if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) { + exec_control &= ~CPU_BASED_TPR_SHADOW; +#ifdef CONFIG_X86_64 + exec_control |= CPU_BASED_CR8_STORE_EXITING | + CPU_BASED_CR8_LOAD_EXITING; +#endif + } + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); + + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf); vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */ vmcs_writel(HOST_CR0, read_cr0()); /* 22.2.3 */ @@ -1377,46 +1580,48 @@ #endif u32 index = vmx_msr_index[i]; u32 data_low, data_high; u64 data; - int j = vcpu->nmsrs; + int j = vmx->nmsrs; if (rdmsr_safe(index, &data_low, &data_high) < 0) continue; if (wrmsr_safe(index, data_low, data_high) < 0) continue; data = data_low | ((u64)data_high << 32); - vcpu->host_msrs[j].index = index; - vcpu->host_msrs[j].reserved = 0; - vcpu->host_msrs[j].data = data; - vcpu->guest_msrs[j] = vcpu->host_msrs[j]; - ++vcpu->nmsrs; + vmx->host_msrs[j].index = index; + vmx->host_msrs[j].reserved = 0; + vmx->host_msrs[j].data = data; + vmx->guest_msrs[j] = vmx->host_msrs[j]; + ++vmx->nmsrs; } - setup_msrs(vcpu); + setup_msrs(vmx); - vmcs_write32_fixedbits(MSR_IA32_VMX_EXIT_CTLS, VM_EXIT_CONTROLS, - (HOST_IS_64 << 9)); /* 22.2,1, 20.7.1 */ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); /* 22.2.1, 20.8.1 */ - vmcs_write32_fixedbits(MSR_IA32_VMX_ENTRY_CTLS, - VM_ENTRY_CONTROLS, 0); + vmcs_write32(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl); + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */ #ifdef CONFIG_X86_64 - vmcs_writel(VIRTUAL_APIC_PAGE_ADDR, 0); - vmcs_writel(TPR_THRESHOLD, 0); + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0); + if (vm_need_tpr_shadow(vmx->vcpu.kvm)) + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, + page_to_phys(vmx->vcpu.apic->regs_page)); + vmcs_write32(TPR_THRESHOLD, 0); #endif vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL); vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK); - vcpu->cr0 = 0x60000010; - vmx_set_cr0(vcpu, vcpu->cr0); // enter rmode - vmx_set_cr4(vcpu, 0); + vmx->vcpu.cr0 = 0x60000010; + vmx_set_cr0(&vmx->vcpu, vmx->vcpu.cr0); // enter rmode + vmx_set_cr4(&vmx->vcpu, 0); #ifdef CONFIG_X86_64 - vmx_set_efer(vcpu, 0); + vmx_set_efer(&vmx->vcpu, 0); #endif - vmx_fpu_activate(vcpu); - update_exception_bitmap(vcpu); + vmx_fpu_activate(&vmx->vcpu); + update_exception_bitmap(&vmx->vcpu); return 0; @@ -1424,6 +1629,13 @@ out: return ret; } +static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + vmx_vcpu_setup(vmx); +} + static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq) { u16 ent[2]; @@ -1443,8 +1655,8 @@ static void inject_rmode_irq(struct kvm_ return; } - if (kvm_read_guest(vcpu, irq * sizeof(ent), sizeof(ent), &ent) != - sizeof(ent)) { + if (emulator_read_std(irq * sizeof(ent), &ent, sizeof(ent), vcpu) != + X86EMUL_CONTINUE) { vcpu_printf(vcpu, "%s: read guest err\n", __FUNCTION__); return; } @@ -1454,9 +1666,9 @@ static void inject_rmode_irq(struct kvm_ ip = vmcs_readl(GUEST_RIP); - if (kvm_write_guest(vcpu, ss_base + sp - 2, 2, &flags) != 2 || - kvm_write_guest(vcpu, ss_base + sp - 4, 2, &cs) != 2 || - kvm_write_guest(vcpu, ss_base + sp - 6, 2, &ip) != 2) { + if (emulator_write_emulated(ss_base + sp - 2, &flags, 2, vcpu) != X86EMUL_CONTINUE || + emulator_write_emulated(ss_base + sp - 4, &cs, 2, vcpu) != X86EMUL_CONTINUE || + emulator_write_emulated(ss_base + sp - 6, &ip, 2, vcpu) != X86EMUL_CONTINUE) { vcpu_printf(vcpu, "%s: write guest err\n", __FUNCTION__); return; } @@ -1469,6 +1681,16 @@ static void inject_rmode_irq(struct kvm_ vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6)); } +static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq) +{ + if (vcpu->rmode.active) { + inject_rmode_irq(vcpu, irq); + return; + } + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); +} + static void kvm_do_inject_irq(struct kvm_vcpu *vcpu) { int word_index = __ffs(vcpu->irq_summary); @@ -1478,13 +1700,7 @@ static void kvm_do_inject_irq(struct kvm clear_bit(bit_index, &vcpu->irq_pending[word_index]); if (!vcpu->irq_pending[word_index]) clear_bit(word_index, &vcpu->irq_summary); - - if (vcpu->rmode.active) { - inject_rmode_irq(vcpu, irq); - return; - } - vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, - irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); + vmx_inject_irq(vcpu, irq); } @@ -1546,7 +1762,7 @@ static int handle_rmode_exception(struct * Cause the #SS fault with 0 error code in VM86 mode. */ if (((vec == GP_VECTOR) || (vec == SS_VECTOR)) && err_code == 0) - if (emulate_instruction(vcpu, NULL, 0, 0) == EMULATE_DONE) + if (emulate_instruction(vcpu, NULL, 0, 0, 0) == EMULATE_DONE) return 1; return 0; } @@ -1568,7 +1784,7 @@ static int handle_exception(struct kvm_v "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info); } - if (is_external_interrupt(vect_info)) { + if (!irqchip_in_kernel(vcpu->kvm) && is_external_interrupt(vect_info)) { int irq = vect_info & VECTORING_INFO_VECTOR_MASK; set_bit(irq, vcpu->irq_pending); set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary); @@ -1584,6 +1800,14 @@ static int handle_exception(struct kvm_v return 1; } + if (is_invalid_opcode(intr_info)) { + er = emulate_instruction(vcpu, kvm_run, 0, 0, 0); + if (er != EMULATE_DONE) + vmx_inject_ud(vcpu); + + return 1; + } + error_code = 0; rip = vmcs_readl(GUEST_RIP); if (intr_info & INTR_INFO_DELIEVER_CODE_MASK) @@ -1591,29 +1815,28 @@ static int handle_exception(struct kvm_v if (is_page_fault(intr_info)) { cr2 = vmcs_readl(EXIT_QUALIFICATION); - spin_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->lock); r = kvm_mmu_page_fault(vcpu, cr2, error_code); if (r < 0) { - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); return r; } if (!r) { - spin_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->lock); return 1; } - er = emulate_instruction(vcpu, kvm_run, cr2, error_code); - spin_unlock(&vcpu->kvm->lock); + er = emulate_instruction(vcpu, kvm_run, cr2, error_code, 0); + mutex_unlock(&vcpu->kvm->lock); switch (er) { case EMULATE_DONE: return 1; case EMULATE_DO_MMIO: ++vcpu->stat.mmio_exits; - kvm_run->exit_reason = KVM_EXIT_MMIO; return 0; case EMULATE_FAIL: - vcpu_printf(vcpu, "%s: emulate fail\n", __FUNCTION__); + kvm_report_emulation_failure(vcpu, "pagetable"); break; default: BUG(); @@ -1653,80 +1876,30 @@ static int handle_triple_fault(struct kv return 0; } -static int get_io_count(struct kvm_vcpu *vcpu, unsigned long *count) -{ - u64 inst; - gva_t rip; - int countr_size; - int i, n; - - if ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_VM)) { - countr_size = 2; - } else { - u32 cs_ar = vmcs_read32(GUEST_CS_AR_BYTES); - - countr_size = (cs_ar & AR_L_MASK) ? 8: - (cs_ar & AR_DB_MASK) ? 4: 2; - } - - rip = vmcs_readl(GUEST_RIP); - if (countr_size != 8) - rip += vmcs_readl(GUEST_CS_BASE); - - n = kvm_read_guest(vcpu, rip, sizeof(inst), &inst); - - for (i = 0; i < n; i++) { - switch (((u8*)&inst)[i]) { - case 0xf0: - case 0xf2: - case 0xf3: - case 0x2e: - case 0x36: - case 0x3e: - case 0x26: - case 0x64: - case 0x65: - case 0x66: - break; - case 0x67: - countr_size = (countr_size == 2) ? 4: (countr_size >> 1); - default: - goto done; - } - } - return 0; -done: - countr_size *= 8; - *count = vcpu->regs[VCPU_REGS_RCX] & (~0ULL >> (64 - countr_size)); - //printk("cx: %lx\n", vcpu->regs[VCPU_REGS_RCX]); - return 1; -} - static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { - u64 exit_qualification; + unsigned long exit_qualification; int size, down, in, string, rep; unsigned port; - unsigned long count; - gva_t address; ++vcpu->stat.io_exits; - exit_qualification = vmcs_read64(EXIT_QUALIFICATION); - in = (exit_qualification & 8) != 0; - size = (exit_qualification & 7) + 1; + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); string = (exit_qualification & 16) != 0; + + if (string) { + if (emulate_instruction(vcpu, + kvm_run, 0, 0, 0) == EMULATE_DO_MMIO) + return 0; + return 1; + } + + size = (exit_qualification & 7) + 1; + in = (exit_qualification & 8) != 0; down = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_DF) != 0; - count = 1; rep = (exit_qualification & 32) != 0; port = exit_qualification >> 16; - address = 0; - if (string) { - if (rep && !get_io_count(vcpu, &count)) - return 1; - address = vmcs_readl(GUEST_LINEAR_ADDRESS); - } - return kvm_setup_pio(vcpu, kvm_run, in, size, count, string, down, - address, rep, port); + + return kvm_emulate_pio(vcpu, kvm_run, in, size, port); } static void @@ -1738,16 +1911,15 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp hypercall[0] = 0x0f; hypercall[1] = 0x01; hypercall[2] = 0xc1; - hypercall[3] = 0xc3; } static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { - u64 exit_qualification; + unsigned long exit_qualification; int cr; int reg; - exit_qualification = vmcs_read64(EXIT_QUALIFICATION); + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); cr = exit_qualification & 15; reg = (exit_qualification >> 8) & 15; switch ((exit_qualification >> 4) & 3) { @@ -1772,13 +1944,14 @@ static int handle_cr(struct kvm_vcpu *vc vcpu_load_rsp_rip(vcpu); set_cr8(vcpu, vcpu->regs[reg]); skip_emulated_instruction(vcpu); - return 1; + kvm_run->exit_reason = KVM_EXIT_SET_TPR; + return 0; }; break; case 2: /* clts */ vcpu_load_rsp_rip(vcpu); vmx_fpu_deactivate(vcpu); - vcpu->cr0 &= ~CR0_TS_MASK; + vcpu->cr0 &= ~X86_CR0_TS; vmcs_writel(CR0_READ_SHADOW, vcpu->cr0); vmx_fpu_activate(vcpu); skip_emulated_instruction(vcpu); @@ -1793,7 +1966,7 @@ static int handle_cr(struct kvm_vcpu *vc return 1; case 8: vcpu_load_rsp_rip(vcpu); - vcpu->regs[reg] = vcpu->cr8; + vcpu->regs[reg] = get_cr8(vcpu); vcpu_put_rsp_rip(vcpu); skip_emulated_instruction(vcpu); return 1; @@ -1808,14 +1981,14 @@ static int handle_cr(struct kvm_vcpu *vc break; } kvm_run->exit_reason = 0; - printk(KERN_ERR "kvm: unhandled control register: op %d cr %d\n", + pr_unimpl(vcpu, "unhandled control register: op %d cr %d\n", (int)(exit_qualification >> 4) & 3, cr); return 0; } static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { - u64 exit_qualification; + unsigned long exit_qualification; unsigned long val; int dr, reg; @@ -1823,7 +1996,7 @@ static int handle_dr(struct kvm_vcpu *vc * FIXME: this code assumes the host is debugging the guest. * need to deal with guest debugging itself too. */ - exit_qualification = vmcs_read64(EXIT_QUALIFICATION); + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); dr = exit_qualification & 7; reg = (exit_qualification >> 8) & 15; vcpu_load_rsp_rip(vcpu); @@ -1886,19 +2059,21 @@ static int handle_wrmsr(struct kvm_vcpu return 1; } -static void post_kvm_run_save(struct kvm_vcpu *vcpu, - struct kvm_run *kvm_run) +static int handle_tpr_below_threshold(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) { - kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0; - kvm_run->cr8 = vcpu->cr8; - kvm_run->apic_base = vcpu->apic_base; - kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open && - vcpu->irq_summary == 0); + return 1; } static int handle_interrupt_window(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { + u32 cpu_based_vm_exec_control; + + /* clear pending irq */ + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); /* * If the user space waits to inject interrupts, exit as soon as * possible @@ -1921,7 +2096,8 @@ static int handle_halt(struct kvm_vcpu * static int handle_vmcall(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { skip_emulated_instruction(vcpu); - return kvm_hypercall(vcpu, kvm_run); + kvm_emulate_hypercall(vcpu); + return 1; } /* @@ -1943,6 +2119,7 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window, [EXIT_REASON_HLT] = handle_halt, [EXIT_REASON_VMCALL] = handle_vmcall, + [EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold }; static const int kvm_vmx_max_exit_handlers = @@ -1956,6 +2133,14 @@ static int kvm_handle_exit(struct kvm_ru { u32 vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); u32 exit_reason = vmcs_read32(VM_EXIT_REASON); + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (unlikely(vmx->fail)) { + kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; + kvm_run->fail_entry.hardware_entry_failure_reason + = vmcs_read32(VM_INSTRUCTION_ERROR); + return 0; + } if ( (vectoring_info & VECTORING_INFO_VALID_MASK) && exit_reason != EXIT_REASON_EXCEPTION_NMI ) @@ -1971,57 +2156,91 @@ static int kvm_handle_exit(struct kvm_ru return 0; } -/* - * Check if userspace requested an interrupt window, and that the - * interrupt window is open. - * - * No need to exit to userspace if we already have an interrupt queued. - */ -static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu, - struct kvm_run *kvm_run) +static void vmx_flush_tlb(struct kvm_vcpu *vcpu) { - return (!vcpu->irq_summary && - kvm_run->request_interrupt_window && - vcpu->interrupt_window_open && - (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)); } -static void vmx_flush_tlb(struct kvm_vcpu *vcpu) +static void update_tpr_threshold(struct kvm_vcpu *vcpu) { + int max_irr, tpr; + + if (!vm_need_tpr_shadow(vcpu->kvm)) + return; + + if (!kvm_lapic_enabled(vcpu) || + ((max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1)) { + vmcs_write32(TPR_THRESHOLD, 0); + return; + } + + tpr = (kvm_lapic_get_cr8(vcpu) & 0x0f) << 4; + vmcs_write32(TPR_THRESHOLD, (max_irr > tpr) ? tpr >> 4 : max_irr >> 4); } -static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +static void enable_irq_window(struct kvm_vcpu *vcpu) { - u8 fail; - int r; + u32 cpu_based_vm_exec_control; -preempted: - if (vcpu->guest_debug.enabled) - kvm_guest_debug_pre(vcpu); + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); +} -again: - if (!vcpu->mmio_read_completed) - do_interrupt_requests(vcpu, kvm_run); +static void vmx_intr_assist(struct kvm_vcpu *vcpu) +{ + u32 idtv_info_field, intr_info_field; + int has_ext_irq, interrupt_window_open; + int vector; - vmx_save_host_state(vcpu); - kvm_load_guest_fpu(vcpu); + kvm_inject_pending_timer_irqs(vcpu); + update_tpr_threshold(vcpu); - r = kvm_mmu_reload(vcpu); - if (unlikely(r)) - goto out; + has_ext_irq = kvm_cpu_has_interrupt(vcpu); + intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); + idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD); + if (intr_info_field & INTR_INFO_VALID_MASK) { + if (idtv_info_field & INTR_INFO_VALID_MASK) { + /* TODO: fault when IDT_Vectoring */ + printk(KERN_ERR "Fault when IDT_Vectoring\n"); + } + if (has_ext_irq) + enable_irq_window(vcpu); + return; + } + if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) { + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field); + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, + vmcs_read32(VM_EXIT_INSTRUCTION_LEN)); + + if (unlikely(idtv_info_field & INTR_INFO_DELIEVER_CODE_MASK)) + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, + vmcs_read32(IDT_VECTORING_ERROR_CODE)); + if (unlikely(has_ext_irq)) + enable_irq_window(vcpu); + return; + } + if (!has_ext_irq) + return; + interrupt_window_open = + ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && + (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0); + if (interrupt_window_open) { + vector = kvm_cpu_get_interrupt(vcpu); + vmx_inject_irq(vcpu, vector); + kvm_timer_intr_post(vcpu, vector); + } else + enable_irq_window(vcpu); +} + +static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); /* * Loading guest fpu may have cleared host cr0.ts */ vmcs_writel(HOST_CR0, read_cr0()); - local_irq_disable(); - - vcpu->guest_mode = 1; - if (vcpu->requests) - if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) - vmx_flush_tlb(vcpu); - asm ( /* Store host registers */ #ifdef CONFIG_X86_64 @@ -2115,8 +2334,8 @@ #else "pop %%ecx; popa \n\t" #endif "setbe %0 \n\t" - : "=q" (fail) - : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP), + : "=q" (vmx->fail) + : "r"(vmx->launched), "d"((unsigned long)HOST_RSP), "c"(vcpu), [rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])), [rbx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBX])), @@ -2138,59 +2357,10 @@ #endif [cr2]"i"(offsetof(struct kvm_vcpu, cr2)) : "cc", "memory" ); - vcpu->guest_mode = 0; - local_irq_enable(); - - ++vcpu->stat.exits; - vcpu->interrupt_window_open = (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0; asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); - - if (unlikely(fail)) { - kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; - kvm_run->fail_entry.hardware_entry_failure_reason - = vmcs_read32(VM_INSTRUCTION_ERROR); - r = 0; - goto out; - } - /* - * Profile KVM exit RIPs: - */ - if (unlikely(prof_on == KVM_PROFILING)) - profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP)); - - vcpu->launched = 1; - r = kvm_handle_exit(kvm_run, vcpu); - if (r > 0) { - /* Give scheduler a change to reschedule. */ - if (signal_pending(current)) { - r = -EINTR; - kvm_run->exit_reason = KVM_EXIT_INTR; - ++vcpu->stat.signal_exits; - goto out; - } - - if (dm_request_for_irq_injection(vcpu, kvm_run)) { - r = -EINTR; - kvm_run->exit_reason = KVM_EXIT_INTR; - ++vcpu->stat.request_irq_exits; - goto out; - } - if (!need_resched()) { - ++vcpu->stat.light_exits; - goto again; - } - } - -out: - if (r > 0) { - kvm_resched(vcpu); - goto preempted; - } - - post_kvm_run_save(vcpu, kvm_run); - return r; + vmx->launched = 1; } static void vmx_inject_page_fault(struct kvm_vcpu *vcpu, @@ -2225,67 +2395,118 @@ static void vmx_inject_page_fault(struct static void vmx_free_vmcs(struct kvm_vcpu *vcpu) { - if (vcpu->vmcs) { - on_each_cpu(__vcpu_clear, vcpu, 0, 1); - free_vmcs(vcpu->vmcs); - vcpu->vmcs = NULL; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (vmx->vmcs) { + on_each_cpu(__vcpu_clear, vmx, 0, 1); + free_vmcs(vmx->vmcs); + vmx->vmcs = NULL; } } static void vmx_free_vcpu(struct kvm_vcpu *vcpu) { + struct vcpu_vmx *vmx = to_vmx(vcpu); + vmx_free_vmcs(vcpu); + kfree(vmx->host_msrs); + kfree(vmx->guest_msrs); + kvm_vcpu_uninit(vcpu); + kmem_cache_free(kvm_vcpu_cache, vmx); } -static int vmx_create_vcpu(struct kvm_vcpu *vcpu) +static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) { - struct vmcs *vmcs; + int err; + struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); + int cpu; - vcpu->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); - if (!vcpu->guest_msrs) - return -ENOMEM; + if (!vmx) + return ERR_PTR(-ENOMEM); - vcpu->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); - if (!vcpu->host_msrs) - goto out_free_guest_msrs; + err = kvm_vcpu_init(&vmx->vcpu, kvm, id); + if (err) + goto free_vcpu; - vmcs = alloc_vmcs(); - if (!vmcs) - goto out_free_msrs; + if (irqchip_in_kernel(kvm)) { + err = kvm_create_lapic(&vmx->vcpu); + if (err < 0) + goto free_vcpu; + } - vmcs_clear(vmcs); - vcpu->vmcs = vmcs; - vcpu->launched = 0; + vmx->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!vmx->guest_msrs) { + err = -ENOMEM; + goto uninit_vcpu; + } - return 0; + vmx->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!vmx->host_msrs) + goto free_guest_msrs; -out_free_msrs: - kfree(vcpu->host_msrs); - vcpu->host_msrs = NULL; + vmx->vmcs = alloc_vmcs(); + if (!vmx->vmcs) + goto free_msrs; -out_free_guest_msrs: - kfree(vcpu->guest_msrs); - vcpu->guest_msrs = NULL; + vmcs_clear(vmx->vmcs); - return -ENOMEM; + cpu = get_cpu(); + vmx_vcpu_load(&vmx->vcpu, cpu); + err = vmx_vcpu_setup(vmx); + vmx_vcpu_put(&vmx->vcpu); + put_cpu(); + if (err) + goto free_vmcs; + + return &vmx->vcpu; + +free_vmcs: + free_vmcs(vmx->vmcs); +free_msrs: + kfree(vmx->host_msrs); +free_guest_msrs: + kfree(vmx->guest_msrs); +uninit_vcpu: + kvm_vcpu_uninit(&vmx->vcpu); +free_vcpu: + kmem_cache_free(kvm_vcpu_cache, vmx); + return ERR_PTR(err); +} + +static void __init vmx_check_processor_compat(void *rtn) +{ + struct vmcs_config vmcs_conf; + + *(int *)rtn = 0; + if (setup_vmcs_config(&vmcs_conf) < 0) + *(int *)rtn = -EIO; + if (memcmp(&vmcs_config, &vmcs_conf, sizeof(struct vmcs_config)) != 0) { + printk(KERN_ERR "kvm: CPU %d feature inconsistency!\n", + smp_processor_id()); + *(int *)rtn = -EIO; + } } -static struct kvm_arch_ops vmx_arch_ops = { +static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, .hardware_setup = hardware_setup, .hardware_unsetup = hardware_unsetup, + .check_processor_compatibility = vmx_check_processor_compat, .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, .vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, + .vcpu_reset = vmx_vcpu_reset, + .prepare_guest_switch = vmx_save_host_state, .vcpu_load = vmx_vcpu_load, .vcpu_put = vmx_vcpu_put, .vcpu_decache = vmx_vcpu_decache, .set_guest_debug = set_guest_debug, + .guest_debug_pre = kvm_guest_debug_pre, .get_msr = vmx_get_msr, .set_msr = vmx_set_msr, .get_segment_base = vmx_get_segment_base, @@ -2314,9 +2535,13 @@ #endif .inject_gp = vmx_inject_gp, .run = vmx_vcpu_run, + .handle_exit = kvm_handle_exit, .skip_emulated_instruction = skip_emulated_instruction, - .vcpu_setup = vmx_vcpu_setup, .patch_hypercall = vmx_patch_hypercall, + .get_irq = vmx_get_irq, + .set_irq = vmx_inject_irq, + .inject_pending_irq = vmx_intr_assist, + .inject_pending_vectors = do_interrupt_requests, }; static int __init vmx_init(void) @@ -2347,10 +2572,13 @@ static int __init vmx_init(void) memset(iova, 0xff, PAGE_SIZE); kunmap(vmx_io_bitmap_b); - r = kvm_init_arch(&vmx_arch_ops, THIS_MODULE); + r = kvm_init_x86(&vmx_x86_ops, sizeof(struct vcpu_vmx), THIS_MODULE); if (r) goto out1; + if (bypass_guest_pf) + kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull); + return 0; out1: @@ -2365,7 +2593,7 @@ static void __exit vmx_exit(void) __free_page(vmx_io_bitmap_b); __free_page(vmx_io_bitmap_a); - kvm_exit_arch(); + kvm_exit_x86(); } module_init(vmx_init) diff --git a/drivers/kvm/vmx.h b/drivers/kvm/vmx.h index d0dc93d..fd4e146 100644 --- a/drivers/kvm/vmx.h +++ b/drivers/kvm/vmx.h @@ -25,29 +25,36 @@ #define VMX_H * */ -#define CPU_BASED_VIRTUAL_INTR_PENDING 0x00000004 -#define CPU_BASED_USE_TSC_OFFSETING 0x00000008 -#define CPU_BASED_HLT_EXITING 0x00000080 -#define CPU_BASED_INVDPG_EXITING 0x00000200 -#define CPU_BASED_MWAIT_EXITING 0x00000400 -#define CPU_BASED_RDPMC_EXITING 0x00000800 -#define CPU_BASED_RDTSC_EXITING 0x00001000 -#define CPU_BASED_CR8_LOAD_EXITING 0x00080000 -#define CPU_BASED_CR8_STORE_EXITING 0x00100000 -#define CPU_BASED_TPR_SHADOW 0x00200000 -#define CPU_BASED_MOV_DR_EXITING 0x00800000 -#define CPU_BASED_UNCOND_IO_EXITING 0x01000000 -#define CPU_BASED_ACTIVATE_IO_BITMAP 0x02000000 -#define CPU_BASED_MSR_BITMAPS 0x10000000 -#define CPU_BASED_MONITOR_EXITING 0x20000000 -#define CPU_BASED_PAUSE_EXITING 0x40000000 +#define CPU_BASED_VIRTUAL_INTR_PENDING 0x00000004 +#define CPU_BASED_USE_TSC_OFFSETING 0x00000008 +#define CPU_BASED_HLT_EXITING 0x00000080 +#define CPU_BASED_INVLPG_EXITING 0x00000200 +#define CPU_BASED_MWAIT_EXITING 0x00000400 +#define CPU_BASED_RDPMC_EXITING 0x00000800 +#define CPU_BASED_RDTSC_EXITING 0x00001000 +#define CPU_BASED_CR8_LOAD_EXITING 0x00080000 +#define CPU_BASED_CR8_STORE_EXITING 0x00100000 +#define CPU_BASED_TPR_SHADOW 0x00200000 +#define CPU_BASED_MOV_DR_EXITING 0x00800000 +#define CPU_BASED_UNCOND_IO_EXITING 0x01000000 +#define CPU_BASED_USE_IO_BITMAPS 0x02000000 +#define CPU_BASED_USE_MSR_BITMAPS 0x10000000 +#define CPU_BASED_MONITOR_EXITING 0x20000000 +#define CPU_BASED_PAUSE_EXITING 0x40000000 +#define CPU_BASED_ACTIVATE_SECONDARY_CONTROLS 0x80000000 -#define PIN_BASED_EXT_INTR_MASK 0x1 -#define PIN_BASED_NMI_EXITING 0x8 +#define PIN_BASED_EXT_INTR_MASK 0x00000001 +#define PIN_BASED_NMI_EXITING 0x00000008 +#define PIN_BASED_VIRTUAL_NMIS 0x00000020 -#define VM_EXIT_ACK_INTR_ON_EXIT 0x00008000 -#define VM_EXIT_HOST_ADD_SPACE_SIZE 0x00000200 +#define VM_EXIT_HOST_ADDR_SPACE_SIZE 0x00000200 +#define VM_EXIT_ACK_INTR_ON_EXIT 0x00008000 +#define VM_ENTRY_IA32E_MODE 0x00000200 +#define VM_ENTRY_SMM 0x00000400 +#define VM_ENTRY_DEACT_DUAL_MONITOR 0x00000800 + +#define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 /* VMCS Encodings */ enum vmcs_field { @@ -206,6 +213,7 @@ #define EXIT_REASON_IO_INSTRUCTION #define EXIT_REASON_MSR_READ 31 #define EXIT_REASON_MSR_WRITE 32 #define EXIT_REASON_MWAIT_INSTRUCTION 36 +#define EXIT_REASON_TPR_BELOW_THRESHOLD 43 /* * Interruption-information format @@ -261,9 +269,6 @@ #define DEBUG_REG_ACCESS_REG /* segment AR */ #define SEGMENT_AR_L_MASK (1 << 13) -/* entry controls */ -#define VM_ENTRY_CONTROLS_IA32E_MASK (1 << 9) - #define AR_TYPE_ACCESSES_MASK 1 #define AR_TYPE_READABLE_MASK (1 << 1) #define AR_TYPE_WRITEABLE_MASK (1 << 2) @@ -285,13 +290,21 @@ #define AR_DPL(ar) (((ar) >> AR_DPL_SHIF #define AR_RESERVD_MASK 0xfffe0f00 -#define CR4_VMXE 0x2000 +#define MSR_IA32_VMX_BASIC 0x480 +#define MSR_IA32_VMX_PINBASED_CTLS 0x481 +#define MSR_IA32_VMX_PROCBASED_CTLS 0x482 +#define MSR_IA32_VMX_EXIT_CTLS 0x483 +#define MSR_IA32_VMX_ENTRY_CTLS 0x484 +#define MSR_IA32_VMX_MISC 0x485 +#define MSR_IA32_VMX_CR0_FIXED0 0x486 +#define MSR_IA32_VMX_CR0_FIXED1 0x487 +#define MSR_IA32_VMX_CR4_FIXED0 0x488 +#define MSR_IA32_VMX_CR4_FIXED1 0x489 +#define MSR_IA32_VMX_VMCS_ENUM 0x48a +#define MSR_IA32_VMX_PROCBASED_CTLS2 0x48b -#define MSR_IA32_VMX_BASIC 0x480 -#define MSR_IA32_FEATURE_CONTROL 0x03a -#define MSR_IA32_VMX_PINBASED_CTLS 0x481 -#define MSR_IA32_VMX_PROCBASED_CTLS 0x482 -#define MSR_IA32_VMX_EXIT_CTLS 0x483 -#define MSR_IA32_VMX_ENTRY_CTLS 0x484 +#define MSR_IA32_FEATURE_CONTROL 0x3a +#define MSR_IA32_FEATURE_CONTROL_LOCKED 0x1 +#define MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED 0x4 #endif diff --git a/drivers/kvm/x86_emulate.c b/drivers/kvm/x86_emulate.c index 4b8a0cc..585cccf 100644 --- a/drivers/kvm/x86_emulate.c +++ b/drivers/kvm/x86_emulate.c @@ -6,7 +6,7 @@ * Copyright (c) 2005 Keir Fraser * * Linux coding style, mod r/m decoder, segment base fixes, real-mode - * privieged instructions: + * privileged instructions: * * Copyright (C) 2006 Qumranet * @@ -83,7 +83,7 @@ static u8 opcode_table[256] = { /* 0x20 - 0x27 */ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, - 0, 0, 0, 0, + SrcImmByte, SrcImm, 0, 0, /* 0x28 - 0x2F */ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, @@ -99,15 +99,24 @@ static u8 opcode_table[256] = { /* 0x40 - 0x4F */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 0x50 - 0x57 */ - 0, 0, 0, 0, 0, 0, 0, 0, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, /* 0x58 - 0x5F */ ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, - /* 0x60 - 0x6F */ + /* 0x60 - 0x67 */ 0, 0, 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ , - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - /* 0x70 - 0x7F */ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, + /* 0x68 - 0x6F */ + 0, 0, ImplicitOps|Mov, 0, + SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, /* insb, insw/insd */ + SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps, /* outsb, outsw/outsd */ + /* 0x70 - 0x77 */ + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + /* 0x78 - 0x7F */ + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, /* 0x80 - 0x87 */ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImm | ModRM, ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM, @@ -116,9 +125,9 @@ static u8 opcode_table[256] = { /* 0x88 - 0x8F */ ByteOp | DstMem | SrcReg | ModRM | Mov, DstMem | SrcReg | ModRM | Mov, ByteOp | DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, - 0, 0, 0, DstMem | SrcNone | ModRM | Mov, + 0, ModRM | DstReg, 0, DstMem | SrcNone | ModRM | Mov, /* 0x90 - 0x9F */ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps, ImplicitOps, 0, 0, /* 0xA0 - 0xA7 */ ByteOp | DstReg | SrcMem | Mov, DstReg | SrcMem | Mov, ByteOp | DstMem | SrcReg | Mov, DstMem | SrcReg | Mov, @@ -142,8 +151,10 @@ static u8 opcode_table[256] = { 0, 0, 0, 0, /* 0xD8 - 0xDF */ 0, 0, 0, 0, 0, 0, 0, 0, - /* 0xE0 - 0xEF */ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xE0 - 0xE7 */ + 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xE8 - 0xEF */ + ImplicitOps, SrcImm|ImplicitOps, 0, SrcImmByte|ImplicitOps, 0, 0, 0, 0, /* 0xF0 - 0xF7 */ 0, 0, 0, 0, ImplicitOps, 0, @@ -181,7 +192,10 @@ static u16 twobyte_table[256] = { /* 0x70 - 0x7F */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 0x80 - 0x8F */ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, + ImplicitOps, ImplicitOps, ImplicitOps, ImplicitOps, /* 0x90 - 0x9F */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 0xA0 - 0xA7 */ @@ -207,26 +221,6 @@ static u16 twobyte_table[256] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* - * Tell the emulator that of the Group 7 instructions (sgdt, lidt, etc.) we - * are interested only in invlpg and not in any of the rest. - * - * invlpg is a special instruction in that the data it references may not - * be mapped. - */ -void kvm_emulator_want_group7_invlpg(void) -{ - twobyte_table[1] &= ~SrcMem; -} -EXPORT_SYMBOL_GPL(kvm_emulator_want_group7_invlpg); - -/* Type, address-of, and value of an instruction's operand. */ -struct operand { - enum { OP_REG, OP_MEM, OP_IMM } type; - unsigned int bytes; - unsigned long val, orig_val, *ptr; -}; - /* EFLAGS bit definitions. */ #define EFLG_OF (1<<11) #define EFLG_DF (1<<10) @@ -420,7 +414,7 @@ #endif /* __i386__ */ #define insn_fetch(_type, _size, _eip) \ ({ unsigned long _x; \ rc = ops->read_std((unsigned long)(_eip) + ctxt->cs_base, &_x, \ - (_size), ctxt); \ + (_size), ctxt->vcpu); \ if ( rc != 0 ) \ goto done; \ (_eip) += (_size); \ @@ -428,23 +422,38 @@ ({ unsigned long _x; \ }) /* Access/update address held in a register, based on addressing mode. */ +#define address_mask(reg) \ + ((c->ad_bytes == sizeof(unsigned long)) ? \ + (reg) : ((reg) & ((1UL << (c->ad_bytes << 3)) - 1))) #define register_address(base, reg) \ - ((base) + ((ad_bytes == sizeof(unsigned long)) ? (reg) : \ - ((reg) & ((1UL << (ad_bytes << 3)) - 1)))) - + ((base) + address_mask(reg)) #define register_address_increment(reg, inc) \ do { \ /* signed type ensures sign extension to long */ \ int _inc = (inc); \ - if ( ad_bytes == sizeof(unsigned long) ) \ + if (c->ad_bytes == sizeof(unsigned long)) \ (reg) += _inc; \ else \ - (reg) = ((reg) & ~((1UL << (ad_bytes << 3)) - 1)) | \ - (((reg) + _inc) & ((1UL << (ad_bytes << 3)) - 1)); \ + (reg) = ((reg) & \ + ~((1UL << (c->ad_bytes << 3)) - 1)) | \ + (((reg) + _inc) & \ + ((1UL << (c->ad_bytes << 3)) - 1)); \ } while (0) -void *decode_register(u8 modrm_reg, unsigned long *regs, - int highbyte_regs) +#define JMP_REL(rel) \ + do { \ + c->eip += (int)(rel); \ + c->eip = ((c->op_bytes == 2) ? \ + (uint16_t)c->eip : (uint32_t)c->eip); \ + } while (0) + +/* + * Given the 'reg' portion of a ModRM byte, and a register block, return a + * pointer into the block that addresses the relevant register. + * @highbyte_regs specifies whether to decode AH,CH,DH,BH. + */ +static void *decode_register(u8 modrm_reg, unsigned long *regs, + int highbyte_regs) { void *p; @@ -464,49 +473,78 @@ static int read_descriptor(struct x86_em if (op_bytes == 2) op_bytes = 3; *address = 0; - rc = ops->read_std((unsigned long)ptr, (unsigned long *)size, 2, ctxt); + rc = ops->read_std((unsigned long)ptr, (unsigned long *)size, 2, + ctxt->vcpu); if (rc) return rc; - rc = ops->read_std((unsigned long)ptr + 2, address, op_bytes, ctxt); + rc = ops->read_std((unsigned long)ptr + 2, address, op_bytes, + ctxt->vcpu); return rc; } +static int test_cc(unsigned int condition, unsigned int flags) +{ + int rc = 0; + + switch ((condition & 15) >> 1) { + case 0: /* o */ + rc |= (flags & EFLG_OF); + break; + case 1: /* b/c/nae */ + rc |= (flags & EFLG_CF); + break; + case 2: /* z/e */ + rc |= (flags & EFLG_ZF); + break; + case 3: /* be/na */ + rc |= (flags & (EFLG_CF|EFLG_ZF)); + break; + case 4: /* s */ + rc |= (flags & EFLG_SF); + break; + case 5: /* p/pe */ + rc |= (flags & EFLG_PF); + break; + case 7: /* le/ng */ + rc |= (flags & EFLG_ZF); + /* fall through */ + case 6: /* l/nge */ + rc |= (!(flags & EFLG_SF) != !(flags & EFLG_OF)); + break; + } + + /* Odd condition identifiers (lsb == 1) have inverted sense. */ + return (!!rc ^ (condition & 1)); +} + int -x86_emulate_memop(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) +x86_decode_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { - unsigned d; - u8 b, sib, twobyte = 0, rex_prefix = 0; - u8 modrm, modrm_mod = 0, modrm_reg = 0, modrm_rm = 0; - unsigned long *override_base = NULL; - unsigned int op_bytes, ad_bytes, lock_prefix = 0, rep_prefix = 0, i; + struct decode_cache *c = &ctxt->decode; + u8 sib, rex_prefix = 0; + unsigned int i; int rc = 0; - struct operand src, dst; - unsigned long cr2 = ctxt->cr2; int mode = ctxt->mode; - unsigned long modrm_ea; - int use_modrm_ea, index_reg = 0, base_reg = 0, scale, rip_relative = 0; - int no_wb = 0; - u64 msr_data; + int index_reg = 0, base_reg = 0, scale, rip_relative = 0; /* Shadow copy of register state. Committed on successful emulation. */ - unsigned long _regs[NR_VCPU_REGS]; - unsigned long _eip = ctxt->vcpu->rip, _eflags = ctxt->eflags; - unsigned long modrm_val = 0; - memcpy(_regs, ctxt->vcpu->regs, sizeof _regs); + memset(c, 0, sizeof(struct decode_cache)); + c->eip = ctxt->vcpu->rip; + memcpy(c->regs, ctxt->vcpu->regs, sizeof c->regs); switch (mode) { case X86EMUL_MODE_REAL: case X86EMUL_MODE_PROT16: - op_bytes = ad_bytes = 2; + c->op_bytes = c->ad_bytes = 2; break; case X86EMUL_MODE_PROT32: - op_bytes = ad_bytes = 4; + c->op_bytes = c->ad_bytes = 4; break; #ifdef CONFIG_X86_64 case X86EMUL_MODE_PROT64: - op_bytes = 4; - ad_bytes = 8; + c->op_bytes = 4; + c->ad_bytes = 8; break; #endif default: @@ -514,222 +552,236 @@ #endif } /* Legacy prefixes. */ - for (i = 0; i < 8; i++) { - switch (b = insn_fetch(u8, 1, _eip)) { + for (;;) { + switch (c->b = insn_fetch(u8, 1, c->eip)) { case 0x66: /* operand-size override */ - op_bytes ^= 6; /* switch between 2/4 bytes */ + c->op_bytes ^= 6; /* switch between 2/4 bytes */ break; case 0x67: /* address-size override */ if (mode == X86EMUL_MODE_PROT64) - ad_bytes ^= 12; /* switch between 4/8 bytes */ + /* switch between 4/8 bytes */ + c->ad_bytes ^= 12; else - ad_bytes ^= 6; /* switch between 2/4 bytes */ + /* switch between 2/4 bytes */ + c->ad_bytes ^= 6; break; case 0x2e: /* CS override */ - override_base = &ctxt->cs_base; + c->override_base = &ctxt->cs_base; break; case 0x3e: /* DS override */ - override_base = &ctxt->ds_base; + c->override_base = &ctxt->ds_base; break; case 0x26: /* ES override */ - override_base = &ctxt->es_base; + c->override_base = &ctxt->es_base; break; case 0x64: /* FS override */ - override_base = &ctxt->fs_base; + c->override_base = &ctxt->fs_base; break; case 0x65: /* GS override */ - override_base = &ctxt->gs_base; + c->override_base = &ctxt->gs_base; break; case 0x36: /* SS override */ - override_base = &ctxt->ss_base; + c->override_base = &ctxt->ss_base; break; + case 0x40 ... 0x4f: /* REX */ + if (mode != X86EMUL_MODE_PROT64) + goto done_prefixes; + rex_prefix = c->b; + continue; case 0xf0: /* LOCK */ - lock_prefix = 1; - break; - case 0xf3: /* REP/REPE/REPZ */ - rep_prefix = 1; + c->lock_prefix = 1; break; case 0xf2: /* REPNE/REPNZ */ + case 0xf3: /* REP/REPE/REPZ */ + c->rep_prefix = 1; break; default: goto done_prefixes; } + + /* Any legacy prefix after a REX prefix nullifies its effect. */ + + rex_prefix = 0; } done_prefixes: /* REX prefix. */ - if ((mode == X86EMUL_MODE_PROT64) && ((b & 0xf0) == 0x40)) { - rex_prefix = b; - if (b & 8) - op_bytes = 8; /* REX.W */ - modrm_reg = (b & 4) << 1; /* REX.R */ - index_reg = (b & 2) << 2; /* REX.X */ - modrm_rm = base_reg = (b & 1) << 3; /* REG.B */ - b = insn_fetch(u8, 1, _eip); + if (rex_prefix) { + if (rex_prefix & 8) + c->op_bytes = 8; /* REX.W */ + c->modrm_reg = (rex_prefix & 4) << 1; /* REX.R */ + index_reg = (rex_prefix & 2) << 2; /* REX.X */ + c->modrm_rm = base_reg = (rex_prefix & 1) << 3; /* REG.B */ } /* Opcode byte(s). */ - d = opcode_table[b]; - if (d == 0) { + c->d = opcode_table[c->b]; + if (c->d == 0) { /* Two-byte opcode? */ - if (b == 0x0f) { - twobyte = 1; - b = insn_fetch(u8, 1, _eip); - d = twobyte_table[b]; + if (c->b == 0x0f) { + c->twobyte = 1; + c->b = insn_fetch(u8, 1, c->eip); + c->d = twobyte_table[c->b]; } /* Unrecognised? */ - if (d == 0) - goto cannot_emulate; + if (c->d == 0) { + DPRINTF("Cannot emulate %02x\n", c->b); + return -1; + } } /* ModRM and SIB bytes. */ - if (d & ModRM) { - modrm = insn_fetch(u8, 1, _eip); - modrm_mod |= (modrm & 0xc0) >> 6; - modrm_reg |= (modrm & 0x38) >> 3; - modrm_rm |= (modrm & 0x07); - modrm_ea = 0; - use_modrm_ea = 1; - - if (modrm_mod == 3) { - modrm_val = *(unsigned long *) - decode_register(modrm_rm, _regs, d & ByteOp); + if (c->d & ModRM) { + c->modrm = insn_fetch(u8, 1, c->eip); + c->modrm_mod |= (c->modrm & 0xc0) >> 6; + c->modrm_reg |= (c->modrm & 0x38) >> 3; + c->modrm_rm |= (c->modrm & 0x07); + c->modrm_ea = 0; + c->use_modrm_ea = 1; + + if (c->modrm_mod == 3) { + c->modrm_val = *(unsigned long *) + decode_register(c->modrm_rm, c->regs, c->d & ByteOp); goto modrm_done; } - if (ad_bytes == 2) { - unsigned bx = _regs[VCPU_REGS_RBX]; - unsigned bp = _regs[VCPU_REGS_RBP]; - unsigned si = _regs[VCPU_REGS_RSI]; - unsigned di = _regs[VCPU_REGS_RDI]; + if (c->ad_bytes == 2) { + unsigned bx = c->regs[VCPU_REGS_RBX]; + unsigned bp = c->regs[VCPU_REGS_RBP]; + unsigned si = c->regs[VCPU_REGS_RSI]; + unsigned di = c->regs[VCPU_REGS_RDI]; /* 16-bit ModR/M decode. */ - switch (modrm_mod) { + switch (c->modrm_mod) { case 0: - if (modrm_rm == 6) - modrm_ea += insn_fetch(u16, 2, _eip); + if (c->modrm_rm == 6) + c->modrm_ea += + insn_fetch(u16, 2, c->eip); break; case 1: - modrm_ea += insn_fetch(s8, 1, _eip); + c->modrm_ea += insn_fetch(s8, 1, c->eip); break; case 2: - modrm_ea += insn_fetch(u16, 2, _eip); + c->modrm_ea += insn_fetch(u16, 2, c->eip); break; } - switch (modrm_rm) { + switch (c->modrm_rm) { case 0: - modrm_ea += bx + si; + c->modrm_ea += bx + si; break; case 1: - modrm_ea += bx + di; + c->modrm_ea += bx + di; break; case 2: - modrm_ea += bp + si; + c->modrm_ea += bp + si; break; case 3: - modrm_ea += bp + di; + c->modrm_ea += bp + di; break; case 4: - modrm_ea += si; + c->modrm_ea += si; break; case 5: - modrm_ea += di; + c->modrm_ea += di; break; case 6: - if (modrm_mod != 0) - modrm_ea += bp; + if (c->modrm_mod != 0) + c->modrm_ea += bp; break; case 7: - modrm_ea += bx; + c->modrm_ea += bx; break; } - if (modrm_rm == 2 || modrm_rm == 3 || - (modrm_rm == 6 && modrm_mod != 0)) - if (!override_base) - override_base = &ctxt->ss_base; - modrm_ea = (u16)modrm_ea; + if (c->modrm_rm == 2 || c->modrm_rm == 3 || + (c->modrm_rm == 6 && c->modrm_mod != 0)) + if (!c->override_base) + c->override_base = &ctxt->ss_base; + c->modrm_ea = (u16)c->modrm_ea; } else { /* 32/64-bit ModR/M decode. */ - switch (modrm_rm) { + switch (c->modrm_rm) { case 4: case 12: - sib = insn_fetch(u8, 1, _eip); + sib = insn_fetch(u8, 1, c->eip); index_reg |= (sib >> 3) & 7; base_reg |= sib & 7; scale = sib >> 6; switch (base_reg) { case 5: - if (modrm_mod != 0) - modrm_ea += _regs[base_reg]; + if (c->modrm_mod != 0) + c->modrm_ea += + c->regs[base_reg]; else - modrm_ea += insn_fetch(s32, 4, _eip); + c->modrm_ea += + insn_fetch(s32, 4, c->eip); break; default: - modrm_ea += _regs[base_reg]; + c->modrm_ea += c->regs[base_reg]; } switch (index_reg) { case 4: break; default: - modrm_ea += _regs[index_reg] << scale; + c->modrm_ea += + c->regs[index_reg] << scale; } break; case 5: - if (modrm_mod != 0) - modrm_ea += _regs[modrm_rm]; + if (c->modrm_mod != 0) + c->modrm_ea += c->regs[c->modrm_rm]; else if (mode == X86EMUL_MODE_PROT64) rip_relative = 1; break; default: - modrm_ea += _regs[modrm_rm]; + c->modrm_ea += c->regs[c->modrm_rm]; break; } - switch (modrm_mod) { + switch (c->modrm_mod) { case 0: - if (modrm_rm == 5) - modrm_ea += insn_fetch(s32, 4, _eip); + if (c->modrm_rm == 5) + c->modrm_ea += + insn_fetch(s32, 4, c->eip); break; case 1: - modrm_ea += insn_fetch(s8, 1, _eip); + c->modrm_ea += insn_fetch(s8, 1, c->eip); break; case 2: - modrm_ea += insn_fetch(s32, 4, _eip); + c->modrm_ea += insn_fetch(s32, 4, c->eip); break; } } - if (!override_base) - override_base = &ctxt->ds_base; + if (!c->override_base) + c->override_base = &ctxt->ds_base; if (mode == X86EMUL_MODE_PROT64 && - override_base != &ctxt->fs_base && - override_base != &ctxt->gs_base) - override_base = NULL; + c->override_base != &ctxt->fs_base && + c->override_base != &ctxt->gs_base) + c->override_base = NULL; - if (override_base) - modrm_ea += *override_base; + if (c->override_base) + c->modrm_ea += *c->override_base; if (rip_relative) { - modrm_ea += _eip; - switch (d & SrcMask) { + c->modrm_ea += c->eip; + switch (c->d & SrcMask) { case SrcImmByte: - modrm_ea += 1; + c->modrm_ea += 1; break; case SrcImm: - if (d & ByteOp) - modrm_ea += 1; + if (c->d & ByteOp) + c->modrm_ea += 1; else - if (op_bytes == 8) - modrm_ea += 4; + if (c->op_bytes == 8) + c->modrm_ea += 4; else - modrm_ea += op_bytes; + c->modrm_ea += c->op_bytes; } } - if (ad_bytes != 8) - modrm_ea = (u32)modrm_ea; - cr2 = modrm_ea; + if (c->ad_bytes != 8) + c->modrm_ea = (u32)c->modrm_ea; modrm_done: ; } @@ -738,162 +790,469 @@ done_prefixes: * Decode and fetch the source operand: register, memory * or immediate. */ - switch (d & SrcMask) { + switch (c->d & SrcMask) { case SrcNone: break; case SrcReg: - src.type = OP_REG; - if (d & ByteOp) { - src.ptr = decode_register(modrm_reg, _regs, + c->src.type = OP_REG; + if (c->d & ByteOp) { + c->src.ptr = + decode_register(c->modrm_reg, c->regs, (rex_prefix == 0)); - src.val = src.orig_val = *(u8 *) src.ptr; - src.bytes = 1; + c->src.val = c->src.orig_val = *(u8 *)c->src.ptr; + c->src.bytes = 1; } else { - src.ptr = decode_register(modrm_reg, _regs, 0); - switch ((src.bytes = op_bytes)) { + c->src.ptr = + decode_register(c->modrm_reg, c->regs, 0); + switch ((c->src.bytes = c->op_bytes)) { case 2: - src.val = src.orig_val = *(u16 *) src.ptr; + c->src.val = c->src.orig_val = + *(u16 *) c->src.ptr; break; case 4: - src.val = src.orig_val = *(u32 *) src.ptr; + c->src.val = c->src.orig_val = + *(u32 *) c->src.ptr; break; case 8: - src.val = src.orig_val = *(u64 *) src.ptr; + c->src.val = c->src.orig_val = + *(u64 *) c->src.ptr; break; } } break; case SrcMem16: - src.bytes = 2; + c->src.bytes = 2; goto srcmem_common; case SrcMem32: - src.bytes = 4; + c->src.bytes = 4; goto srcmem_common; case SrcMem: - src.bytes = (d & ByteOp) ? 1 : op_bytes; + c->src.bytes = (c->d & ByteOp) ? 1 : + c->op_bytes; + /* Don't fetch the address for invlpg: it could be unmapped. */ + if (c->twobyte && c->b == 0x01 + && c->modrm_reg == 7) + break; srcmem_common: - src.type = OP_MEM; - src.ptr = (unsigned long *)cr2; - if ((rc = ops->read_emulated((unsigned long)src.ptr, - &src.val, src.bytes, ctxt)) != 0) - goto done; - src.orig_val = src.val; + c->src.type = OP_MEM; break; case SrcImm: - src.type = OP_IMM; - src.ptr = (unsigned long *)_eip; - src.bytes = (d & ByteOp) ? 1 : op_bytes; - if (src.bytes == 8) - src.bytes = 4; + c->src.type = OP_IMM; + c->src.ptr = (unsigned long *)c->eip; + c->src.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + if (c->src.bytes == 8) + c->src.bytes = 4; /* NB. Immediates are sign-extended as necessary. */ - switch (src.bytes) { + switch (c->src.bytes) { case 1: - src.val = insn_fetch(s8, 1, _eip); + c->src.val = insn_fetch(s8, 1, c->eip); break; case 2: - src.val = insn_fetch(s16, 2, _eip); + c->src.val = insn_fetch(s16, 2, c->eip); break; case 4: - src.val = insn_fetch(s32, 4, _eip); + c->src.val = insn_fetch(s32, 4, c->eip); break; } break; case SrcImmByte: - src.type = OP_IMM; - src.ptr = (unsigned long *)_eip; - src.bytes = 1; - src.val = insn_fetch(s8, 1, _eip); + c->src.type = OP_IMM; + c->src.ptr = (unsigned long *)c->eip; + c->src.bytes = 1; + c->src.val = insn_fetch(s8, 1, c->eip); break; } /* Decode and fetch the destination operand: register or memory. */ - switch (d & DstMask) { + switch (c->d & DstMask) { case ImplicitOps: /* Special instructions do their own operand decoding. */ - goto special_insn; + return 0; case DstReg: - dst.type = OP_REG; - if ((d & ByteOp) - && !(twobyte_table && (b == 0xb6 || b == 0xb7))) { - dst.ptr = decode_register(modrm_reg, _regs, + c->dst.type = OP_REG; + if ((c->d & ByteOp) + && !(c->twobyte && + (c->b == 0xb6 || c->b == 0xb7))) { + c->dst.ptr = + decode_register(c->modrm_reg, c->regs, (rex_prefix == 0)); - dst.val = *(u8 *) dst.ptr; - dst.bytes = 1; + c->dst.val = *(u8 *) c->dst.ptr; + c->dst.bytes = 1; } else { - dst.ptr = decode_register(modrm_reg, _regs, 0); - switch ((dst.bytes = op_bytes)) { + c->dst.ptr = + decode_register(c->modrm_reg, c->regs, 0); + switch ((c->dst.bytes = c->op_bytes)) { case 2: - dst.val = *(u16 *)dst.ptr; + c->dst.val = *(u16 *)c->dst.ptr; break; case 4: - dst.val = *(u32 *)dst.ptr; + c->dst.val = *(u32 *)c->dst.ptr; break; case 8: - dst.val = *(u64 *)dst.ptr; + c->dst.val = *(u64 *)c->dst.ptr; break; } } break; case DstMem: - dst.type = OP_MEM; - dst.ptr = (unsigned long *)cr2; - dst.bytes = (d & ByteOp) ? 1 : op_bytes; - if (d & BitOp) { - unsigned long mask = ~(dst.bytes * 8 - 1); + c->dst.type = OP_MEM; + break; + } + +done: + return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; +} + +static inline void emulate_push(struct x86_emulate_ctxt *ctxt) +{ + struct decode_cache *c = &ctxt->decode; + + c->dst.type = OP_MEM; + c->dst.bytes = c->op_bytes; + c->dst.val = c->src.val; + register_address_increment(c->regs[VCPU_REGS_RSP], -c->op_bytes); + c->dst.ptr = (void *) register_address(ctxt->ss_base, + c->regs[VCPU_REGS_RSP]); +} - dst.ptr = (void *)dst.ptr + (src.val & mask) / 8; +static inline int emulate_grp1a(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) +{ + struct decode_cache *c = &ctxt->decode; + int rc; + + /* 64-bit mode: POP always pops a 64-bit operand. */ + + if (ctxt->mode == X86EMUL_MODE_PROT64) + c->dst.bytes = 8; + + rc = ops->read_std(register_address(ctxt->ss_base, + c->regs[VCPU_REGS_RSP]), + &c->dst.val, c->dst.bytes, ctxt->vcpu); + if (rc != 0) + return rc; + + register_address_increment(c->regs[VCPU_REGS_RSP], c->dst.bytes); + + return 0; +} + +static inline void emulate_grp2(struct x86_emulate_ctxt *ctxt) +{ + struct decode_cache *c = &ctxt->decode; + switch (c->modrm_reg) { + case 0: /* rol */ + emulate_2op_SrcB("rol", c->src, c->dst, ctxt->eflags); + break; + case 1: /* ror */ + emulate_2op_SrcB("ror", c->src, c->dst, ctxt->eflags); + break; + case 2: /* rcl */ + emulate_2op_SrcB("rcl", c->src, c->dst, ctxt->eflags); + break; + case 3: /* rcr */ + emulate_2op_SrcB("rcr", c->src, c->dst, ctxt->eflags); + break; + case 4: /* sal/shl */ + case 6: /* sal/shl */ + emulate_2op_SrcB("sal", c->src, c->dst, ctxt->eflags); + break; + case 5: /* shr */ + emulate_2op_SrcB("shr", c->src, c->dst, ctxt->eflags); + break; + case 7: /* sar */ + emulate_2op_SrcB("sar", c->src, c->dst, ctxt->eflags); + break; + } +} + +static inline int emulate_grp3(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) +{ + struct decode_cache *c = &ctxt->decode; + int rc = 0; + + switch (c->modrm_reg) { + case 0 ... 1: /* test */ + /* + * Special case in Grp3: test has an immediate + * source operand. + */ + c->src.type = OP_IMM; + c->src.ptr = (unsigned long *)c->eip; + c->src.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + if (c->src.bytes == 8) + c->src.bytes = 4; + switch (c->src.bytes) { + case 1: + c->src.val = insn_fetch(s8, 1, c->eip); + break; + case 2: + c->src.val = insn_fetch(s16, 2, c->eip); + break; + case 4: + c->src.val = insn_fetch(s32, 4, c->eip); + break; } - if (!(d & Mov) && /* optimisation - avoid slow emulated read */ - ((rc = ops->read_emulated((unsigned long)dst.ptr, - &dst.val, dst.bytes, ctxt)) != 0)) - goto done; + emulate_2op_SrcV("test", c->src, c->dst, ctxt->eflags); + break; + case 2: /* not */ + c->dst.val = ~c->dst.val; + break; + case 3: /* neg */ + emulate_1op("neg", c->dst, ctxt->eflags); + break; + default: + DPRINTF("Cannot emulate %02x\n", c->b); + rc = X86EMUL_UNHANDLEABLE; + break; + } +done: + return rc; +} + +static inline int emulate_grp45(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) +{ + struct decode_cache *c = &ctxt->decode; + int rc; + + switch (c->modrm_reg) { + case 0: /* inc */ + emulate_1op("inc", c->dst, ctxt->eflags); + break; + case 1: /* dec */ + emulate_1op("dec", c->dst, ctxt->eflags); + break; + case 4: /* jmp abs */ + if (c->b == 0xff) + c->eip = c->dst.val; + else { + DPRINTF("Cannot emulate %02x\n", c->b); + return X86EMUL_UNHANDLEABLE; + } + break; + case 6: /* push */ + + /* 64-bit mode: PUSH always pushes a 64-bit operand. */ + + if (ctxt->mode == X86EMUL_MODE_PROT64) { + c->dst.bytes = 8; + rc = ops->read_std((unsigned long)c->dst.ptr, + &c->dst.val, 8, ctxt->vcpu); + if (rc != 0) + return rc; + } + register_address_increment(c->regs[VCPU_REGS_RSP], + -c->dst.bytes); + rc = ops->write_std(register_address(ctxt->ss_base, + c->regs[VCPU_REGS_RSP]), &c->dst.val, + c->dst.bytes, ctxt->vcpu); + if (rc != 0) + return rc; + c->dst.type = OP_NONE; + break; + default: + DPRINTF("Cannot emulate %02x\n", c->b); + return X86EMUL_UNHANDLEABLE; + } + return 0; +} + +static inline int emulate_grp9(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + unsigned long cr2) +{ + struct decode_cache *c = &ctxt->decode; + u64 old, new; + int rc; + + rc = ops->read_emulated(cr2, &old, 8, ctxt->vcpu); + if (rc != 0) + return rc; + + if (((u32) (old >> 0) != (u32) c->regs[VCPU_REGS_RAX]) || + ((u32) (old >> 32) != (u32) c->regs[VCPU_REGS_RDX])) { + + c->regs[VCPU_REGS_RAX] = (u32) (old >> 0); + c->regs[VCPU_REGS_RDX] = (u32) (old >> 32); + ctxt->eflags &= ~EFLG_ZF; + + } else { + new = ((u64)c->regs[VCPU_REGS_RCX] << 32) | + (u32) c->regs[VCPU_REGS_RBX]; + + rc = ops->cmpxchg_emulated(cr2, &old, &new, 8, ctxt->vcpu); + if (rc != 0) + return rc; + ctxt->eflags |= EFLG_ZF; + } + return 0; +} + +static inline int writeback(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) +{ + int rc; + struct decode_cache *c = &ctxt->decode; + + switch (c->dst.type) { + case OP_REG: + /* The 4-byte case *is* correct: + * in 64-bit mode we zero-extend. + */ + switch (c->dst.bytes) { + case 1: + *(u8 *)c->dst.ptr = (u8)c->dst.val; + break; + case 2: + *(u16 *)c->dst.ptr = (u16)c->dst.val; + break; + case 4: + *c->dst.ptr = (u32)c->dst.val; + break; /* 64b: zero-ext */ + case 8: + *c->dst.ptr = c->dst.val; + break; + } + break; + case OP_MEM: + if (c->lock_prefix) + rc = ops->cmpxchg_emulated( + (unsigned long)c->dst.ptr, + &c->dst.orig_val, + &c->dst.val, + c->dst.bytes, + ctxt->vcpu); + else + rc = ops->write_emulated( + (unsigned long)c->dst.ptr, + &c->dst.val, + c->dst.bytes, + ctxt->vcpu); + if (rc != 0) + return rc; + break; + case OP_NONE: + /* no writeback */ break; + default: + break; + } + return 0; +} + +int +x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) +{ + unsigned long cr2 = ctxt->cr2; + u64 msr_data; + unsigned long saved_rcx = 0, saved_eip = 0; + struct decode_cache *c = &ctxt->decode; + int rc = 0; + + if ((c->d & ModRM) && (c->modrm_mod != 3)) + cr2 = c->modrm_ea; + + if (c->src.type == OP_MEM) { + c->src.ptr = (unsigned long *)cr2; + c->src.val = 0; + if ((rc = ops->read_emulated((unsigned long)c->src.ptr, + &c->src.val, + c->src.bytes, + ctxt->vcpu)) != 0) + goto done; + c->src.orig_val = c->src.val; } - dst.orig_val = dst.val; - if (twobyte) + if ((c->d & DstMask) == ImplicitOps) + goto special_insn; + + + if (c->dst.type == OP_MEM) { + c->dst.ptr = (unsigned long *)cr2; + c->dst.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + c->dst.val = 0; + if (c->d & BitOp) { + unsigned long mask = ~(c->dst.bytes * 8 - 1); + + c->dst.ptr = (void *)c->dst.ptr + + (c->src.val & mask) / 8; + } + if (!(c->d & Mov) && + /* optimisation - avoid slow emulated read */ + ((rc = ops->read_emulated((unsigned long)c->dst.ptr, + &c->dst.val, + c->dst.bytes, ctxt->vcpu)) != 0)) + goto done; + } + c->dst.orig_val = c->dst.val; + + if (c->twobyte) goto twobyte_insn; - switch (b) { + switch (c->b) { case 0x00 ... 0x05: add: /* add */ - emulate_2op_SrcV("add", src, dst, _eflags); + emulate_2op_SrcV("add", c->src, c->dst, ctxt->eflags); break; case 0x08 ... 0x0d: or: /* or */ - emulate_2op_SrcV("or", src, dst, _eflags); + emulate_2op_SrcV("or", c->src, c->dst, ctxt->eflags); break; case 0x10 ... 0x15: adc: /* adc */ - emulate_2op_SrcV("adc", src, dst, _eflags); + emulate_2op_SrcV("adc", c->src, c->dst, ctxt->eflags); break; case 0x18 ... 0x1d: sbb: /* sbb */ - emulate_2op_SrcV("sbb", src, dst, _eflags); + emulate_2op_SrcV("sbb", c->src, c->dst, ctxt->eflags); break; - case 0x20 ... 0x25: + case 0x20 ... 0x23: and: /* and */ - emulate_2op_SrcV("and", src, dst, _eflags); + emulate_2op_SrcV("and", c->src, c->dst, ctxt->eflags); break; + case 0x24: /* and al imm8 */ + c->dst.type = OP_REG; + c->dst.ptr = &c->regs[VCPU_REGS_RAX]; + c->dst.val = *(u8 *)c->dst.ptr; + c->dst.bytes = 1; + c->dst.orig_val = c->dst.val; + goto and; + case 0x25: /* and ax imm16, or eax imm32 */ + c->dst.type = OP_REG; + c->dst.bytes = c->op_bytes; + c->dst.ptr = &c->regs[VCPU_REGS_RAX]; + if (c->op_bytes == 2) + c->dst.val = *(u16 *)c->dst.ptr; + else + c->dst.val = *(u32 *)c->dst.ptr; + c->dst.orig_val = c->dst.val; + goto and; case 0x28 ... 0x2d: sub: /* sub */ - emulate_2op_SrcV("sub", src, dst, _eflags); + emulate_2op_SrcV("sub", c->src, c->dst, ctxt->eflags); break; case 0x30 ... 0x35: xor: /* xor */ - emulate_2op_SrcV("xor", src, dst, _eflags); + emulate_2op_SrcV("xor", c->src, c->dst, ctxt->eflags); break; case 0x38 ... 0x3d: cmp: /* cmp */ - emulate_2op_SrcV("cmp", src, dst, _eflags); + emulate_2op_SrcV("cmp", c->src, c->dst, ctxt->eflags); break; case 0x63: /* movsxd */ - if (mode != X86EMUL_MODE_PROT64) + if (ctxt->mode != X86EMUL_MODE_PROT64) goto cannot_emulate; - dst.val = (s32) src.val; + c->dst.val = (s32) c->src.val; + break; + case 0x6a: /* push imm8 */ + c->src.val = 0L; + c->src.val = insn_fetch(s8, 1, c->eip); + emulate_push(ctxt); break; case 0x80 ... 0x83: /* Grp1 */ - switch (modrm_reg) { + switch (c->modrm_reg) { case 0: goto add; case 1: @@ -913,301 +1272,325 @@ done_prefixes: } break; case 0x84 ... 0x85: - test: /* test */ - emulate_2op_SrcV("test", src, dst, _eflags); + emulate_2op_SrcV("test", c->src, c->dst, ctxt->eflags); break; case 0x86 ... 0x87: /* xchg */ /* Write back the register source. */ - switch (dst.bytes) { + switch (c->dst.bytes) { case 1: - *(u8 *) src.ptr = (u8) dst.val; + *(u8 *) c->src.ptr = (u8) c->dst.val; break; case 2: - *(u16 *) src.ptr = (u16) dst.val; + *(u16 *) c->src.ptr = (u16) c->dst.val; break; case 4: - *src.ptr = (u32) dst.val; + *c->src.ptr = (u32) c->dst.val; break; /* 64b reg: zero-extend */ case 8: - *src.ptr = dst.val; + *c->src.ptr = c->dst.val; break; } /* * Write back the memory destination with implicit LOCK * prefix. */ - dst.val = src.val; - lock_prefix = 1; - break; - case 0xa0 ... 0xa1: /* mov */ - dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; - dst.val = src.val; - _eip += ad_bytes; /* skip src displacement */ - break; - case 0xa2 ... 0xa3: /* mov */ - dst.val = (unsigned long)_regs[VCPU_REGS_RAX]; - _eip += ad_bytes; /* skip dst displacement */ + c->dst.val = c->src.val; + c->lock_prefix = 1; break; case 0x88 ... 0x8b: /* mov */ - case 0xc6 ... 0xc7: /* mov (sole member of Grp11) */ - dst.val = src.val; + goto mov; + case 0x8d: /* lea r16/r32, m */ + c->dst.val = c->modrm_val; break; case 0x8f: /* pop (sole member of Grp1a) */ - /* 64-bit mode: POP always pops a 64-bit operand. */ - if (mode == X86EMUL_MODE_PROT64) - dst.bytes = 8; - if ((rc = ops->read_std(register_address(ctxt->ss_base, - _regs[VCPU_REGS_RSP]), - &dst.val, dst.bytes, ctxt)) != 0) + rc = emulate_grp1a(ctxt, ops); + if (rc != 0) goto done; - register_address_increment(_regs[VCPU_REGS_RSP], dst.bytes); + break; + case 0xa0 ... 0xa1: /* mov */ + c->dst.ptr = (unsigned long *)&c->regs[VCPU_REGS_RAX]; + c->dst.val = c->src.val; + /* skip src displacement */ + c->eip += c->ad_bytes; + break; + case 0xa2 ... 0xa3: /* mov */ + c->dst.val = (unsigned long)c->regs[VCPU_REGS_RAX]; + /* skip c->dst displacement */ + c->eip += c->ad_bytes; break; case 0xc0 ... 0xc1: - grp2: /* Grp2 */ - switch (modrm_reg) { - case 0: /* rol */ - emulate_2op_SrcB("rol", src, dst, _eflags); - break; - case 1: /* ror */ - emulate_2op_SrcB("ror", src, dst, _eflags); - break; - case 2: /* rcl */ - emulate_2op_SrcB("rcl", src, dst, _eflags); - break; - case 3: /* rcr */ - emulate_2op_SrcB("rcr", src, dst, _eflags); - break; - case 4: /* sal/shl */ - case 6: /* sal/shl */ - emulate_2op_SrcB("sal", src, dst, _eflags); - break; - case 5: /* shr */ - emulate_2op_SrcB("shr", src, dst, _eflags); - break; - case 7: /* sar */ - emulate_2op_SrcB("sar", src, dst, _eflags); - break; - } + emulate_grp2(ctxt); + break; + case 0xc6 ... 0xc7: /* mov (sole member of Grp11) */ + mov: + c->dst.val = c->src.val; break; case 0xd0 ... 0xd1: /* Grp2 */ - src.val = 1; - goto grp2; + c->src.val = 1; + emulate_grp2(ctxt); + break; case 0xd2 ... 0xd3: /* Grp2 */ - src.val = _regs[VCPU_REGS_RCX]; - goto grp2; + c->src.val = c->regs[VCPU_REGS_RCX]; + emulate_grp2(ctxt); + break; case 0xf6 ... 0xf7: /* Grp3 */ - switch (modrm_reg) { - case 0 ... 1: /* test */ - /* - * Special case in Grp3: test has an immediate - * source operand. - */ - src.type = OP_IMM; - src.ptr = (unsigned long *)_eip; - src.bytes = (d & ByteOp) ? 1 : op_bytes; - if (src.bytes == 8) - src.bytes = 4; - switch (src.bytes) { - case 1: - src.val = insn_fetch(s8, 1, _eip); - break; - case 2: - src.val = insn_fetch(s16, 2, _eip); - break; - case 4: - src.val = insn_fetch(s32, 4, _eip); - break; - } - goto test; - case 2: /* not */ - dst.val = ~dst.val; - break; - case 3: /* neg */ - emulate_1op("neg", dst, _eflags); - break; - default: - goto cannot_emulate; - } + rc = emulate_grp3(ctxt, ops); + if (rc != 0) + goto done; break; case 0xfe ... 0xff: /* Grp4/Grp5 */ - switch (modrm_reg) { - case 0: /* inc */ - emulate_1op("inc", dst, _eflags); - break; - case 1: /* dec */ - emulate_1op("dec", dst, _eflags); - break; - case 6: /* push */ - /* 64-bit mode: PUSH always pushes a 64-bit operand. */ - if (mode == X86EMUL_MODE_PROT64) { - dst.bytes = 8; - if ((rc = ops->read_std((unsigned long)dst.ptr, - &dst.val, 8, - ctxt)) != 0) - goto done; - } - register_address_increment(_regs[VCPU_REGS_RSP], - -dst.bytes); - if ((rc = ops->write_std( - register_address(ctxt->ss_base, - _regs[VCPU_REGS_RSP]), - &dst.val, dst.bytes, ctxt)) != 0) - goto done; - no_wb = 1; - break; - default: - goto cannot_emulate; - } + rc = emulate_grp45(ctxt, ops); + if (rc != 0) + goto done; break; } writeback: - if (!no_wb) { - switch (dst.type) { - case OP_REG: - /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ - switch (dst.bytes) { - case 1: - *(u8 *)dst.ptr = (u8)dst.val; - break; - case 2: - *(u16 *)dst.ptr = (u16)dst.val; - break; - case 4: - *dst.ptr = (u32)dst.val; - break; /* 64b: zero-ext */ - case 8: - *dst.ptr = dst.val; - break; - } - break; - case OP_MEM: - if (lock_prefix) - rc = ops->cmpxchg_emulated((unsigned long)dst. - ptr, &dst.orig_val, - &dst.val, dst.bytes, - ctxt); - else - rc = ops->write_emulated((unsigned long)dst.ptr, - &dst.val, dst.bytes, - ctxt); - if (rc != 0) - goto done; - default: - break; - } - } + rc = writeback(ctxt, ops); + if (rc != 0) + goto done; /* Commit shadow register state. */ - memcpy(ctxt->vcpu->regs, _regs, sizeof _regs); - ctxt->eflags = _eflags; - ctxt->vcpu->rip = _eip; + memcpy(ctxt->vcpu->regs, c->regs, sizeof c->regs); + ctxt->vcpu->rip = c->eip; done: return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; special_insn: - if (twobyte) + if (c->twobyte) goto twobyte_special_insn; - if (rep_prefix) { - if (_regs[VCPU_REGS_RCX] == 0) { - ctxt->vcpu->rip = _eip; + switch (c->b) { + case 0x50 ... 0x57: /* push reg */ + if (c->op_bytes == 2) + c->src.val = (u16) c->regs[c->b & 0x7]; + else + c->src.val = (u32) c->regs[c->b & 0x7]; + c->dst.type = OP_MEM; + c->dst.bytes = c->op_bytes; + c->dst.val = c->src.val; + register_address_increment(c->regs[VCPU_REGS_RSP], + -c->op_bytes); + c->dst.ptr = (void *) register_address( + ctxt->ss_base, c->regs[VCPU_REGS_RSP]); + break; + case 0x58 ... 0x5f: /* pop reg */ + c->dst.ptr = (unsigned long *)&c->regs[c->b & 0x7]; + pop_instruction: + if ((rc = ops->read_std(register_address(ctxt->ss_base, + c->regs[VCPU_REGS_RSP]), c->dst.ptr, + c->op_bytes, ctxt->vcpu)) != 0) { + if (c->rep_prefix) { + c->regs[VCPU_REGS_RCX] = saved_rcx; + c->eip = saved_eip; + } + goto done; + } + + register_address_increment(c->regs[VCPU_REGS_RSP], + c->op_bytes); + c->dst.type = OP_NONE; /* Disable writeback. */ + break; + case 0x6c: /* insb */ + case 0x6d: /* insw/insd */ + if (kvm_emulate_pio_string(ctxt->vcpu, NULL, + 1, + (c->d & ByteOp) ? 1 : c->op_bytes, + c->rep_prefix ? + address_mask(c->regs[VCPU_REGS_RCX]) : 1, + (ctxt->eflags & EFLG_DF), + register_address(ctxt->es_base, + c->regs[VCPU_REGS_RDI]), + c->rep_prefix, + c->regs[VCPU_REGS_RDX]) == 0) + return -1; + return 0; + case 0x6e: /* outsb */ + case 0x6f: /* outsw/outsd */ + if (kvm_emulate_pio_string(ctxt->vcpu, NULL, + 0, + (c->d & ByteOp) ? 1 : c->op_bytes, + c->rep_prefix ? + address_mask(c->regs[VCPU_REGS_RCX]) : 1, + (ctxt->eflags & EFLG_DF), + register_address(c->override_base ? + *c->override_base : + ctxt->ds_base, + c->regs[VCPU_REGS_RSI]), + c->rep_prefix, + c->regs[VCPU_REGS_RDX]) == 0) + return -1; + return 0; + case 0x70 ... 0x7f: /* jcc (short) */ { + int rel = insn_fetch(s8, 1, c->eip); + + if (test_cc(c->b, ctxt->eflags)) + JMP_REL(rel); + break; + } + case 0x9c: /* pushf */ + c->src.val = (unsigned long) ctxt->eflags; + emulate_push(ctxt); + break; + case 0x9d: /* popf */ + c->dst.ptr = (unsigned long *) &ctxt->eflags; + goto pop_instruction; + case 0xc3: /* ret */ + c->dst.ptr = &c->eip; + goto pop_instruction; + case 0xf4: /* hlt */ + ctxt->vcpu->halt_request = 1; + goto done; + } + if (c->rep_prefix) { + if (c->regs[VCPU_REGS_RCX] == 0) { + ctxt->vcpu->rip = c->eip; goto done; } - _regs[VCPU_REGS_RCX]--; - _eip = ctxt->vcpu->rip; + saved_rcx = c->regs[VCPU_REGS_RCX]; + saved_eip = c->eip; + c->regs[VCPU_REGS_RCX]--; + c->eip = ctxt->vcpu->rip; } - switch (b) { + switch (c->b) { case 0xa4 ... 0xa5: /* movs */ - dst.type = OP_MEM; - dst.bytes = (d & ByteOp) ? 1 : op_bytes; - dst.ptr = (unsigned long *)register_address(ctxt->es_base, - _regs[VCPU_REGS_RDI]); + c->dst.type = OP_MEM; + c->dst.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + c->dst.ptr = (unsigned long *)register_address( + ctxt->es_base, + c->regs[VCPU_REGS_RDI]); if ((rc = ops->read_emulated(register_address( - override_base ? *override_base : ctxt->ds_base, - _regs[VCPU_REGS_RSI]), &dst.val, dst.bytes, ctxt)) != 0) + c->override_base ? *c->override_base : + ctxt->ds_base, + c->regs[VCPU_REGS_RSI]), + &c->dst.val, + c->dst.bytes, ctxt->vcpu)) != 0) { + if (c->rep_prefix) { + c->regs[VCPU_REGS_RCX] = saved_rcx; + c->eip = saved_eip; + } goto done; - register_address_increment(_regs[VCPU_REGS_RSI], - (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); - register_address_increment(_regs[VCPU_REGS_RDI], - (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + } + register_address_increment(c->regs[VCPU_REGS_RSI], + (ctxt->eflags & EFLG_DF) ? -c->dst.bytes + : c->dst.bytes); + register_address_increment(c->regs[VCPU_REGS_RDI], + (ctxt->eflags & EFLG_DF) ? -c->dst.bytes + : c->dst.bytes); break; case 0xa6 ... 0xa7: /* cmps */ DPRINTF("Urk! I don't handle CMPS.\n"); goto cannot_emulate; case 0xaa ... 0xab: /* stos */ - dst.type = OP_MEM; - dst.bytes = (d & ByteOp) ? 1 : op_bytes; - dst.ptr = (unsigned long *)cr2; - dst.val = _regs[VCPU_REGS_RAX]; - register_address_increment(_regs[VCPU_REGS_RDI], - (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + c->dst.type = OP_MEM; + c->dst.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + c->dst.ptr = (unsigned long *)cr2; + c->dst.val = c->regs[VCPU_REGS_RAX]; + register_address_increment(c->regs[VCPU_REGS_RDI], + (ctxt->eflags & EFLG_DF) ? -c->dst.bytes + : c->dst.bytes); break; case 0xac ... 0xad: /* lods */ - dst.type = OP_REG; - dst.bytes = (d & ByteOp) ? 1 : op_bytes; - dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; - if ((rc = ops->read_emulated(cr2, &dst.val, dst.bytes, ctxt)) != 0) + c->dst.type = OP_REG; + c->dst.bytes = (c->d & ByteOp) ? 1 : c->op_bytes; + c->dst.ptr = (unsigned long *)&c->regs[VCPU_REGS_RAX]; + if ((rc = ops->read_emulated(cr2, &c->dst.val, + c->dst.bytes, + ctxt->vcpu)) != 0) { + if (c->rep_prefix) { + c->regs[VCPU_REGS_RCX] = saved_rcx; + c->eip = saved_eip; + } goto done; - register_address_increment(_regs[VCPU_REGS_RSI], - (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + } + register_address_increment(c->regs[VCPU_REGS_RSI], + (ctxt->eflags & EFLG_DF) ? -c->dst.bytes + : c->dst.bytes); break; case 0xae ... 0xaf: /* scas */ DPRINTF("Urk! I don't handle SCAS.\n"); goto cannot_emulate; - case 0xf4: /* hlt */ - ctxt->vcpu->halt_request = 1; - goto done; - case 0xc3: /* ret */ - dst.ptr = &_eip; - goto pop_instruction; - case 0x58 ... 0x5f: /* pop reg */ - dst.ptr = (unsigned long *)&_regs[b & 0x7]; + case 0xe8: /* call (near) */ { + long int rel; + switch (c->op_bytes) { + case 2: + rel = insn_fetch(s16, 2, c->eip); + break; + case 4: + rel = insn_fetch(s32, 4, c->eip); + break; + case 8: + rel = insn_fetch(s64, 8, c->eip); + break; + default: + DPRINTF("Call: Invalid op_bytes\n"); + goto cannot_emulate; + } + c->src.val = (unsigned long) c->eip; + JMP_REL(rel); + emulate_push(ctxt); + break; + } + case 0xe9: /* jmp rel */ + case 0xeb: /* jmp rel short */ + JMP_REL(c->src.val); + c->dst.type = OP_NONE; /* Disable writeback. */ + break; -pop_instruction: - if ((rc = ops->read_std(register_address(ctxt->ss_base, - _regs[VCPU_REGS_RSP]), dst.ptr, op_bytes, ctxt)) != 0) - goto done; - register_address_increment(_regs[VCPU_REGS_RSP], op_bytes); - no_wb = 1; /* Disable writeback. */ - break; } goto writeback; twobyte_insn: - switch (b) { + switch (c->b) { case 0x01: /* lgdt, lidt, lmsw */ - /* Disable writeback. */ - no_wb = 1; - switch (modrm_reg) { + switch (c->modrm_reg) { u16 size; unsigned long address; - case 2: /* lgdt */ - rc = read_descriptor(ctxt, ops, src.ptr, - &size, &address, op_bytes); + case 0: /* vmcall */ + if (c->modrm_mod != 3 || c->modrm_rm != 1) + goto cannot_emulate; + + rc = kvm_fix_hypercall(ctxt->vcpu); if (rc) goto done; - realmode_lgdt(ctxt->vcpu, size, address); + + kvm_emulate_hypercall(ctxt->vcpu); break; - case 3: /* lidt */ - rc = read_descriptor(ctxt, ops, src.ptr, - &size, &address, op_bytes); + case 2: /* lgdt */ + rc = read_descriptor(ctxt, ops, c->src.ptr, + &size, &address, c->op_bytes); if (rc) goto done; - realmode_lidt(ctxt->vcpu, size, address); + realmode_lgdt(ctxt->vcpu, size, address); + break; + case 3: /* lidt/vmmcall */ + if (c->modrm_mod == 3 && c->modrm_rm == 1) { + rc = kvm_fix_hypercall(ctxt->vcpu); + if (rc) + goto done; + kvm_emulate_hypercall(ctxt->vcpu); + } else { + rc = read_descriptor(ctxt, ops, c->src.ptr, + &size, &address, + c->op_bytes); + if (rc) + goto done; + realmode_lidt(ctxt->vcpu, size, address); + } break; case 4: /* smsw */ - if (modrm_mod != 3) + if (c->modrm_mod != 3) goto cannot_emulate; - *(u16 *)&_regs[modrm_rm] + *(u16 *)&c->regs[c->modrm_rm] = realmode_get_cr(ctxt->vcpu, 0); break; case 6: /* lmsw */ - if (modrm_mod != 3) + if (c->modrm_mod != 3) goto cannot_emulate; - realmode_lmsw(ctxt->vcpu, (u16)modrm_val, &_eflags); + realmode_lmsw(ctxt->vcpu, (u16)c->modrm_val, + &ctxt->eflags); break; case 7: /* invlpg*/ emulate_invlpg(ctxt->vcpu, cr2); @@ -1215,101 +1598,74 @@ twobyte_insn: default: goto cannot_emulate; } + /* Disable writeback. */ + c->dst.type = OP_NONE; break; case 0x21: /* mov from dr to reg */ - no_wb = 1; - if (modrm_mod != 3) + if (c->modrm_mod != 3) + goto cannot_emulate; + rc = emulator_get_dr(ctxt, c->modrm_reg, &c->regs[c->modrm_rm]); + if (rc) goto cannot_emulate; - rc = emulator_get_dr(ctxt, modrm_reg, &_regs[modrm_rm]); + c->dst.type = OP_NONE; /* no writeback */ break; case 0x23: /* mov from reg to dr */ - no_wb = 1; - if (modrm_mod != 3) + if (c->modrm_mod != 3) goto cannot_emulate; - rc = emulator_set_dr(ctxt, modrm_reg, _regs[modrm_rm]); + rc = emulator_set_dr(ctxt, c->modrm_reg, + c->regs[c->modrm_rm]); + if (rc) + goto cannot_emulate; + c->dst.type = OP_NONE; /* no writeback */ break; case 0x40 ... 0x4f: /* cmov */ - dst.val = dst.orig_val = src.val; - d &= ~Mov; /* default to no move */ - /* - * First, assume we're decoding an even cmov opcode - * (lsb == 0). - */ - switch ((b & 15) >> 1) { - case 0: /* cmovo */ - d |= (_eflags & EFLG_OF) ? Mov : 0; - break; - case 1: /* cmovb/cmovc/cmovnae */ - d |= (_eflags & EFLG_CF) ? Mov : 0; - break; - case 2: /* cmovz/cmove */ - d |= (_eflags & EFLG_ZF) ? Mov : 0; - break; - case 3: /* cmovbe/cmovna */ - d |= (_eflags & (EFLG_CF | EFLG_ZF)) ? Mov : 0; - break; - case 4: /* cmovs */ - d |= (_eflags & EFLG_SF) ? Mov : 0; - break; - case 5: /* cmovp/cmovpe */ - d |= (_eflags & EFLG_PF) ? Mov : 0; - break; - case 7: /* cmovle/cmovng */ - d |= (_eflags & EFLG_ZF) ? Mov : 0; - /* fall through */ - case 6: /* cmovl/cmovnge */ - d |= (!(_eflags & EFLG_SF) != - !(_eflags & EFLG_OF)) ? Mov : 0; - break; - } - /* Odd cmov opcodes (lsb == 1) have inverted sense. */ - d ^= (b & 1) ? Mov : 0; + c->dst.val = c->dst.orig_val = c->src.val; + if (!test_cc(c->b, ctxt->eflags)) + c->dst.type = OP_NONE; /* no writeback */ + break; + case 0xa3: + bt: /* bt */ + c->dst.type = OP_NONE; + /* only subword offset */ + c->src.val &= (c->dst.bytes << 3) - 1; + emulate_2op_SrcV_nobyte("bt", c->src, c->dst, ctxt->eflags); + break; + case 0xab: + bts: /* bts */ + /* only subword offset */ + c->src.val &= (c->dst.bytes << 3) - 1; + emulate_2op_SrcV_nobyte("bts", c->src, c->dst, ctxt->eflags); break; case 0xb0 ... 0xb1: /* cmpxchg */ /* * Save real source value, then compare EAX against * destination. */ - src.orig_val = src.val; - src.val = _regs[VCPU_REGS_RAX]; - emulate_2op_SrcV("cmp", src, dst, _eflags); - /* Always write back. The question is: where to? */ - d |= Mov; - if (_eflags & EFLG_ZF) { + c->src.orig_val = c->src.val; + c->src.val = c->regs[VCPU_REGS_RAX]; + emulate_2op_SrcV("cmp", c->src, c->dst, ctxt->eflags); + if (ctxt->eflags & EFLG_ZF) { /* Success: write back to memory. */ - dst.val = src.orig_val; + c->dst.val = c->src.orig_val; } else { /* Failure: write the value we saw to EAX. */ - dst.type = OP_REG; - dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; + c->dst.type = OP_REG; + c->dst.ptr = (unsigned long *)&c->regs[VCPU_REGS_RAX]; } break; - case 0xa3: - bt: /* bt */ - src.val &= (dst.bytes << 3) - 1; /* only subword offset */ - emulate_2op_SrcV_nobyte("bt", src, dst, _eflags); - break; case 0xb3: btr: /* btr */ - src.val &= (dst.bytes << 3) - 1; /* only subword offset */ - emulate_2op_SrcV_nobyte("btr", src, dst, _eflags); - break; - case 0xab: - bts: /* bts */ - src.val &= (dst.bytes << 3) - 1; /* only subword offset */ - emulate_2op_SrcV_nobyte("bts", src, dst, _eflags); + /* only subword offset */ + c->src.val &= (c->dst.bytes << 3) - 1; + emulate_2op_SrcV_nobyte("btr", c->src, c->dst, ctxt->eflags); break; case 0xb6 ... 0xb7: /* movzx */ - dst.bytes = op_bytes; - dst.val = (d & ByteOp) ? (u8) src.val : (u16) src.val; - break; - case 0xbb: - btc: /* btc */ - src.val &= (dst.bytes << 3) - 1; /* only subword offset */ - emulate_2op_SrcV_nobyte("btc", src, dst, _eflags); + c->dst.bytes = c->op_bytes; + c->dst.val = (c->d & ByteOp) ? (u8) c->src.val + : (u16) c->src.val; break; case 0xba: /* Grp8 */ - switch (modrm_reg & 3) { + switch (c->modrm_reg & 3) { case 0: goto bt; case 1: @@ -1320,121 +1676,97 @@ twobyte_insn: goto btc; } break; + case 0xbb: + btc: /* btc */ + /* only subword offset */ + c->src.val &= (c->dst.bytes << 3) - 1; + emulate_2op_SrcV_nobyte("btc", c->src, c->dst, ctxt->eflags); + break; case 0xbe ... 0xbf: /* movsx */ - dst.bytes = op_bytes; - dst.val = (d & ByteOp) ? (s8) src.val : (s16) src.val; + c->dst.bytes = c->op_bytes; + c->dst.val = (c->d & ByteOp) ? (s8) c->src.val : + (s16) c->src.val; break; } goto writeback; twobyte_special_insn: - /* Disable writeback. */ - no_wb = 1; - switch (b) { + switch (c->b) { + case 0x06: + emulate_clts(ctxt->vcpu); + break; case 0x09: /* wbinvd */ break; case 0x0d: /* GrpP (prefetch) */ case 0x18: /* Grp16 (prefetch/nop) */ break; - case 0x06: - emulate_clts(ctxt->vcpu); - break; case 0x20: /* mov cr, reg */ - if (modrm_mod != 3) + if (c->modrm_mod != 3) goto cannot_emulate; - _regs[modrm_rm] = realmode_get_cr(ctxt->vcpu, modrm_reg); + c->regs[c->modrm_rm] = + realmode_get_cr(ctxt->vcpu, c->modrm_reg); break; case 0x22: /* mov reg, cr */ - if (modrm_mod != 3) + if (c->modrm_mod != 3) goto cannot_emulate; - realmode_set_cr(ctxt->vcpu, modrm_reg, modrm_val, &_eflags); + realmode_set_cr(ctxt->vcpu, + c->modrm_reg, c->modrm_val, &ctxt->eflags); break; case 0x30: /* wrmsr */ - msr_data = (u32)_regs[VCPU_REGS_RAX] - | ((u64)_regs[VCPU_REGS_RDX] << 32); - rc = kvm_set_msr(ctxt->vcpu, _regs[VCPU_REGS_RCX], msr_data); + msr_data = (u32)c->regs[VCPU_REGS_RAX] + | ((u64)c->regs[VCPU_REGS_RDX] << 32); + rc = kvm_set_msr(ctxt->vcpu, c->regs[VCPU_REGS_RCX], msr_data); if (rc) { - kvm_arch_ops->inject_gp(ctxt->vcpu, 0); - _eip = ctxt->vcpu->rip; + kvm_x86_ops->inject_gp(ctxt->vcpu, 0); + c->eip = ctxt->vcpu->rip; } rc = X86EMUL_CONTINUE; break; case 0x32: /* rdmsr */ - rc = kvm_get_msr(ctxt->vcpu, _regs[VCPU_REGS_RCX], &msr_data); + rc = kvm_get_msr(ctxt->vcpu, c->regs[VCPU_REGS_RCX], &msr_data); if (rc) { - kvm_arch_ops->inject_gp(ctxt->vcpu, 0); - _eip = ctxt->vcpu->rip; + kvm_x86_ops->inject_gp(ctxt->vcpu, 0); + c->eip = ctxt->vcpu->rip; } else { - _regs[VCPU_REGS_RAX] = (u32)msr_data; - _regs[VCPU_REGS_RDX] = msr_data >> 32; + c->regs[VCPU_REGS_RAX] = (u32)msr_data; + c->regs[VCPU_REGS_RDX] = msr_data >> 32; } rc = X86EMUL_CONTINUE; break; - case 0xc7: /* Grp9 (cmpxchg8b) */ - { - u64 old, new; - if ((rc = ops->read_emulated(cr2, &old, 8, ctxt)) != 0) - goto done; - if (((u32) (old >> 0) != (u32) _regs[VCPU_REGS_RAX]) || - ((u32) (old >> 32) != (u32) _regs[VCPU_REGS_RDX])) { - _regs[VCPU_REGS_RAX] = (u32) (old >> 0); - _regs[VCPU_REGS_RDX] = (u32) (old >> 32); - _eflags &= ~EFLG_ZF; - } else { - new = ((u64)_regs[VCPU_REGS_RCX] << 32) - | (u32) _regs[VCPU_REGS_RBX]; - if ((rc = ops->cmpxchg_emulated(cr2, &old, - &new, 8, ctxt)) != 0) - goto done; - _eflags |= EFLG_ZF; - } + case 0x80 ... 0x8f: /* jnz rel, etc*/ { + long int rel; + + switch (c->op_bytes) { + case 2: + rel = insn_fetch(s16, 2, c->eip); break; + case 4: + rel = insn_fetch(s32, 4, c->eip); + break; + case 8: + rel = insn_fetch(s64, 8, c->eip); + break; + default: + DPRINTF("jnz: Invalid op_bytes\n"); + goto cannot_emulate; } + if (test_cc(c->b, ctxt->eflags)) + JMP_REL(rel); + break; + } + case 0xc7: /* Grp9 (cmpxchg8b) */ + rc = emulate_grp9(ctxt, ops, cr2); + if (rc != 0) + goto done; + break; } + /* Disable writeback. */ + c->dst.type = OP_NONE; goto writeback; cannot_emulate: - DPRINTF("Cannot emulate %02x\n", b); + DPRINTF("Cannot emulate %02x\n", c->b); return -1; } - -#ifdef __XEN__ - -#include -#include - -int -x86_emulate_read_std(unsigned long addr, - unsigned long *val, - unsigned int bytes, struct x86_emulate_ctxt *ctxt) -{ - unsigned int rc; - - *val = 0; - - if ((rc = copy_from_user((void *)val, (void *)addr, bytes)) != 0) { - propagate_page_fault(addr + bytes - rc, 0); /* read fault */ - return X86EMUL_PROPAGATE_FAULT; - } - - return X86EMUL_CONTINUE; -} - -int -x86_emulate_write_std(unsigned long addr, - unsigned long val, - unsigned int bytes, struct x86_emulate_ctxt *ctxt) -{ - unsigned int rc; - - if ((rc = copy_to_user((void *)addr, (void *)&val, bytes)) != 0) { - propagate_page_fault(addr + bytes - rc, PGERR_write_access); - return X86EMUL_PROPAGATE_FAULT; - } - - return X86EMUL_CONTINUE; -} - -#endif diff --git a/drivers/kvm/x86_emulate.h b/drivers/kvm/x86_emulate.h index ea3407d..f03b128 100644 --- a/drivers/kvm/x86_emulate.h +++ b/drivers/kvm/x86_emulate.h @@ -60,7 +60,7 @@ struct x86_emulate_ops { * @bytes: [IN ] Number of bytes to read from memory. */ int (*read_std)(unsigned long addr, void *val, - unsigned int bytes, struct x86_emulate_ctxt * ctxt); + unsigned int bytes, struct kvm_vcpu *vcpu); /* * write_std: Write bytes of standard (non-emulated/special) memory. @@ -71,7 +71,7 @@ struct x86_emulate_ops { * @bytes: [IN ] Number of bytes to write to memory. */ int (*write_std)(unsigned long addr, const void *val, - unsigned int bytes, struct x86_emulate_ctxt * ctxt); + unsigned int bytes, struct kvm_vcpu *vcpu); /* * read_emulated: Read bytes from emulated/special memory area. @@ -82,7 +82,7 @@ struct x86_emulate_ops { int (*read_emulated) (unsigned long addr, void *val, unsigned int bytes, - struct x86_emulate_ctxt * ctxt); + struct kvm_vcpu *vcpu); /* * write_emulated: Read bytes from emulated/special memory area. @@ -94,7 +94,7 @@ struct x86_emulate_ops { int (*write_emulated) (unsigned long addr, const void *val, unsigned int bytes, - struct x86_emulate_ctxt * ctxt); + struct kvm_vcpu *vcpu); /* * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an @@ -108,11 +108,39 @@ struct x86_emulate_ops { const void *old, const void *new, unsigned int bytes, - struct x86_emulate_ctxt * ctxt); + struct kvm_vcpu *vcpu); }; -struct cpu_user_regs; +/* Type, address-of, and value of an instruction's operand. */ +struct operand { + enum { OP_REG, OP_MEM, OP_IMM, OP_NONE } type; + unsigned int bytes; + unsigned long val, orig_val, *ptr; +}; + +struct decode_cache { + u8 twobyte; + u8 b; + u8 lock_prefix; + u8 rep_prefix; + u8 op_bytes; + u8 ad_bytes; + struct operand src; + struct operand dst; + unsigned long *override_base; + unsigned int d; + unsigned long regs[NR_VCPU_REGS]; + unsigned long eip; + /* modrm */ + u8 modrm; + u8 modrm_mod; + u8 modrm_reg; + u8 modrm_rm; + u8 use_modrm_ea; + unsigned long modrm_ea; + unsigned long modrm_val; +}; struct x86_emulate_ctxt { /* Register state before/after emulation. */ @@ -131,6 +159,10 @@ struct x86_emulate_ctxt { unsigned long ss_base; unsigned long gs_base; unsigned long fs_base; + + /* decode cache */ + + struct decode_cache decode; }; /* Execution mode, passed to the emulator. */ @@ -146,20 +178,9 @@ #elif defined(CONFIG_X86_64) #define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64 #endif -/* - * x86_emulate_memop: Emulate an instruction that faulted attempting to - * read/write a 'special' memory area. - * Returns -1 on failure, 0 on success. - */ -int x86_emulate_memop(struct x86_emulate_ctxt *ctxt, - struct x86_emulate_ops *ops); - -/* - * Given the 'reg' portion of a ModRM byte, and a register block, return a - * pointer into the block that addresses the relevant register. - * @highbyte_regs specifies whether to decode AH,CH,DH,BH. - */ -void *decode_register(u8 modrm_reg, unsigned long *regs, - int highbyte_regs); +int x86_decode_insn(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops); +int x86_emulate_insn(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops); #endif /* __X86_EMULATE_H__ */ diff --git a/include/asm-i386/processor-flags.h b/include/asm-i386/processor-flags.h index 5404e90..199cab1 100644 --- a/include/asm-i386/processor-flags.h +++ b/include/asm-i386/processor-flags.h @@ -63,7 +63,7 @@ #define X86_CR4_VMXE 0x00002000 /* enabl /* * x86-64 Task Priority Register, CR8 */ -#define X86_CR8_TPR 0x00000007 /* task priority register */ +#define X86_CR8_TPR 0x0000000F /* task priority register */ /* * AMD and Transmeta use MSRs for configuration; see diff --git a/include/linux/kvm.h b/include/linux/kvm.h index e6edca8..30a8369 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -4,8 +4,7 @@ #define __LINUX_KVM_H /* * Userspace interface for /dev/kvm - kernel based virtual machine * - * Note: this interface is considered experimental and may change without - * notice. + * Note: you must update KVM_API_VERSION if you change this interface. */ #include @@ -13,14 +12,8 @@ #include #define KVM_API_VERSION 12 -/* - * Architectural interrupt line count, and the size of the bitmap needed - * to hold them. - */ +/* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 -#define KVM_IRQ_BITMAP_SIZE_BYTES ((KVM_NR_INTERRUPTS + 7) / 8) -#define KVM_IRQ_BITMAP_SIZE(type) (KVM_IRQ_BITMAP_SIZE_BYTES / sizeof(type)) - /* for KVM_CREATE_MEMORY_REGION */ struct kvm_memory_region { @@ -41,6 +34,78 @@ struct kvm_memory_alias { __u64 target_phys_addr; }; +/* for KVM_IRQ_LINE */ +struct kvm_irq_level { + /* + * ACPI gsi notion of irq. + * For IA-64 (APIC model) IOAPIC0: irq 0-23; IOAPIC1: irq 24-47.. + * For X86 (standard AT mode) PIC0/1: irq 0-15. IOAPIC0: 0-23.. + */ + __u32 irq; + __u32 level; +}; + +/* for KVM_GET_IRQCHIP and KVM_SET_IRQCHIP */ +struct kvm_pic_state { + __u8 last_irr; /* edge detection */ + __u8 irr; /* interrupt request register */ + __u8 imr; /* interrupt mask register */ + __u8 isr; /* interrupt service register */ + __u8 priority_add; /* highest irq priority */ + __u8 irq_base; + __u8 read_reg_select; + __u8 poll; + __u8 special_mask; + __u8 init_state; + __u8 auto_eoi; + __u8 rotate_on_auto_eoi; + __u8 special_fully_nested_mode; + __u8 init4; /* true if 4 byte init */ + __u8 elcr; /* PIIX edge/trigger selection */ + __u8 elcr_mask; +}; + +#define KVM_IOAPIC_NUM_PINS 24 +struct kvm_ioapic_state { + __u64 base_address; + __u32 ioregsel; + __u32 id; + __u32 irr; + __u32 pad; + union { + __u64 bits; + struct { + __u8 vector; + __u8 delivery_mode:3; + __u8 dest_mode:1; + __u8 delivery_status:1; + __u8 polarity:1; + __u8 remote_irr:1; + __u8 trig_mode:1; + __u8 mask:1; + __u8 reserve:7; + __u8 reserved[4]; + __u8 dest_id; + } fields; + } redirtbl[KVM_IOAPIC_NUM_PINS]; +}; + +enum kvm_irqchip_id { + KVM_IRQCHIP_PIC_MASTER = 0, + KVM_IRQCHIP_PIC_SLAVE = 1, + KVM_IRQCHIP_IOAPIC = 2, +}; + +struct kvm_irqchip { + __u32 chip_id; + __u32 pad; + union { + char dummy[512]; /* reserving space */ + struct kvm_pic_state pic; + struct kvm_ioapic_state ioapic; + } chip; +}; + enum kvm_exit_reason { KVM_EXIT_UNKNOWN = 0, KVM_EXIT_EXCEPTION = 1, @@ -53,6 +118,7 @@ enum kvm_exit_reason { KVM_EXIT_SHUTDOWN = 8, KVM_EXIT_FAIL_ENTRY = 9, KVM_EXIT_INTR = 10, + KVM_EXIT_SET_TPR = 11 }; /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ @@ -106,11 +172,14 @@ #define KVM_EXIT_IO_OUT 1 } mmio; /* KVM_EXIT_HYPERCALL */ struct { + __u64 nr; __u64 args[6]; __u64 ret; __u32 longmode; __u32 pad; } hypercall; + /* Fix the size of the union. */ + char padding[256]; }; }; @@ -139,6 +208,12 @@ struct kvm_fpu { __u32 pad2; }; +/* for KVM_GET_LAPIC and KVM_SET_LAPIC */ +#define KVM_APIC_REG_SIZE 0x400 +struct kvm_lapic_state { + char regs[KVM_APIC_REG_SIZE]; +}; + struct kvm_segment { __u64 base; __u32 limit; @@ -164,7 +239,7 @@ struct kvm_sregs { __u64 cr0, cr2, cr3, cr4, cr8; __u64 efer; __u64 apic_base; - __u64 interrupt_bitmap[KVM_IRQ_BITMAP_SIZE(__u64)]; + __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; }; struct kvm_msr_entry { @@ -272,6 +347,12 @@ #define KVM_CHECK_EXTENSION _IO(KV #define KVM_GET_VCPU_MMAP_SIZE _IO(KVMIO, 0x04) /* in bytes */ /* + * Extension capability list. + */ +#define KVM_CAP_IRQCHIP 0 +#define KVM_CAP_HLT 1 + +/* * ioctls for VM fds */ #define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region) @@ -282,6 +363,11 @@ #define KVM_SET_MEMORY_REGION _IOW(K #define KVM_CREATE_VCPU _IO(KVMIO, 0x41) #define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) #define KVM_SET_MEMORY_ALIAS _IOW(KVMIO, 0x43, struct kvm_memory_alias) +/* Device model IOC */ +#define KVM_CREATE_IRQCHIP _IO(KVMIO, 0x60) +#define KVM_IRQ_LINE _IOW(KVMIO, 0x61, struct kvm_irq_level) +#define KVM_GET_IRQCHIP _IOWR(KVMIO, 0x62, struct kvm_irqchip) +#define KVM_SET_IRQCHIP _IOR(KVMIO, 0x63, struct kvm_irqchip) /* * ioctls for vcpu fds @@ -300,5 +386,7 @@ #define KVM_SET_CPUID _IOW(K #define KVM_SET_SIGNAL_MASK _IOW(KVMIO, 0x8b, struct kvm_signal_mask) #define KVM_GET_FPU _IOR(KVMIO, 0x8c, struct kvm_fpu) #define KVM_SET_FPU _IOW(KVMIO, 0x8d, struct kvm_fpu) +#define KVM_GET_LAPIC _IOR(KVMIO, 0x8e, struct kvm_lapic_state) +#define KVM_SET_LAPIC _IOW(KVMIO, 0x8f, struct kvm_lapic_state) #endif diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index 3b29256..cc5dfb4 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -1,73 +1,110 @@ #ifndef __LINUX_KVM_PARA_H #define __LINUX_KVM_PARA_H -/* - * Guest OS interface for KVM paravirtualization - * - * Note: this interface is totally experimental, and is certain to change - * as we make progress. +/* This CPUID returns the signature 'KVMKVMKVM' in ebx, ecx, and edx. It + * should be used to determine that a VM is running under KVM. */ +#define KVM_CPUID_SIGNATURE 0x40000000 -/* - * Per-VCPU descriptor area shared between guest and host. Writable to - * both guest and host. Registered with the host by the guest when - * a guest acknowledges paravirtual mode. - * - * NOTE: all addresses are guest-physical addresses (gpa), to make it - * easier for the hypervisor to map between the various addresses. - */ -struct kvm_vcpu_para_state { - /* - * API version information for compatibility. If there's any support - * mismatch (too old host trying to execute too new guest) then - * the host will deny entry into paravirtual mode. Any other - * combination (new host + old guest and new host + new guest) - * is supposed to work - new host versions will support all old - * guest API versions. - */ - u32 guest_version; - u32 host_version; - u32 size; - u32 ret; - - /* - * The address of the vm exit instruction (VMCALL or VMMCALL), - * which the host will patch according to the CPU model the - * VM runs on: - */ - u64 hypercall_gpa; - -} __attribute__ ((aligned(PAGE_SIZE))); - -#define KVM_PARA_API_VERSION 1 - -/* - * This is used for an RDMSR's ECX parameter to probe for a KVM host. - * Hopefully no CPU vendor will use up this number. This is placed well - * out of way of the typical space occupied by CPU vendors' MSR indices, - * and we think (or at least hope) it wont be occupied in the future - * either. +/* This CPUID returns a feature bitmap in eax. Before enabling a particular + * paravirtualization, the appropriate feature bit should be checked. */ -#define MSR_KVM_API_MAGIC 0x87655678 +#define KVM_CPUID_FEATURES 0x40000001 -#define KVM_EINVAL 1 +/* Return values for hypercalls */ +#define KVM_ENOSYS 1000 -/* - * Hypercall calling convention: - * - * Each hypercall may have 0-6 parameters. - * - * 64-bit hypercall index is in RAX, goes from 0 to __NR_hypercalls-1 - * - * 64-bit parameters 1-6 are in the standard gcc x86_64 calling convention - * order: RDI, RSI, RDX, RCX, R8, R9. - * - * 32-bit index is EBX, parameters are: EAX, ECX, EDX, ESI, EDI, EBP. - * (the first 3 are according to the gcc regparm calling convention) +#ifdef __KERNEL__ +#include + +/* This instruction is vmcall. On non-VT architectures, it will generate a + * trap that we will then rewrite to the appropriate instruction. + */ +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" + +/* For KVM hypercalls, a three-byte sequence of either the vmrun or the vmmrun + * instruction. The hypervisor may replace it with something else but only the + * instructions are guaranteed to be supported. * - * No registers are clobbered by the hypercall, except that the - * return value is in RAX. + * Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively. + * The hypercall number should be placed in rax and the return value will be + * placed in rax. No other registers will be clobbered unless explicited + * noted by the particular hypercall. */ -#define __NR_hypercalls 0 + +static inline long kvm_hypercall0(unsigned int nr) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr)); + return ret; +} + +static inline long kvm_hypercall1(unsigned int nr, unsigned long p1) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1)); + return ret; +} + +static inline long kvm_hypercall2(unsigned int nr, unsigned long p1, + unsigned long p2) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2)); + return ret; +} + +static inline long kvm_hypercall3(unsigned int nr, unsigned long p1, + unsigned long p2, unsigned long p3) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2), "d"(p3)); + return ret; +} + +static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, + unsigned long p2, unsigned long p3, + unsigned long p4) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)); + return ret; +} + +static inline int kvm_para_available(void) +{ + unsigned int eax, ebx, ecx, edx; + char signature[13]; + + cpuid(KVM_CPUID_SIGNATURE, &eax, &ebx, &ecx, &edx); + memcpy(signature + 0, &ebx, 4); + memcpy(signature + 4, &ecx, 4); + memcpy(signature + 8, &edx, 4); + signature[12] = 0; + + if (strcmp(signature, "KVMKVMKVM") == 0) + return 1; + + return 0; +} + +static inline int kvm_para_has_feature(unsigned int feature) +{ + if (cpuid_eax(KVM_CPUID_FEATURES) & (1UL << feature)) + return 1; + return 0; +} + +#endif #endif