GIT a80408da7a05e0be2ae99ad47dafd4bb4bc847cd git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git#master commit Author: Avi Kivity Date: Tue Jun 5 14:37:09 2007 +0300 KVM: Enable guest smp As we don't support guest tlb shootdown yet, this is only reliable for real-mode guests. Signed-off-by: Avi Kivity commit 80b70c068ce4333e5e1242f32f538835a4e5d896 Author: Avi Kivity Date: Tue Jun 5 14:36:10 2007 +0300 KVM: Fix adding an smp virtual machine to the vm list If we add the vm once per vcpu, we corrupt the list if the guest has multiple vcpus. Signed-off-by: Avi Kivity commit 16fb83998b62717831dca3d913455091c855b3cd Author: Avi Kivity Date: Tue Jun 5 12:17:03 2007 +0300 KVM: Fix vcpu freeing for guest smp A vcpu can pin up to four mmu shadow pages, which means the freeing loop will never terminate. Fix by first unpinning shadow pages on all vcpus, then freeing shadow pages. Signed-off-by: Avi Kivity commit 55ae364d6a882c94511db17e8023c8976d44cd2d Author: Nguyen Anh Quynh Date: Tue Jun 5 10:35:19 2007 +0300 KVM: Remove unnecessary initialization and checks in mark_page_dirty() Signed-off-by: Avi Kivity commit 0ae1aebcc9825fba4d115c197e9c099fd9644caf Author: Robert P. J. Day Date: Sun Jun 3 13:35:29 2007 -0400 KVM: Replace C code with call to ARRAY_SIZE() macro. Signed-off-by: Robert P. J. Day Signed-off-by: Avi Kivity commit 4b82b37a35a085a07d9ed84efee06c69655fd3d1 Author: Avi Kivity Date: Mon Jun 4 15:58:30 2007 +0300 KVM: Lazy guest cr3 switching Switch guest paging context may require us to allocate memory, which might fail. Instead of wiring up error paths everywhere, make context switching lazy and actually do the switch before the next guest entry, where we can return an error if allocation fails. Signed-off-by: Avi Kivity commit fa8cfb020b0ef0acef94ddc9035b932308840314 Author: Avi Kivity Date: Mon Jun 4 11:11:23 2007 +0300 KVM: VMX: Fix asm constraint "g" can select a memory location, in which case size information is lost and gas needs an instruction suffix. Since the suffix is different for i386 and x86_64, we simply change the constraint to "r". Signed-off-by: Avi Kivity commit 63275ba244275719d6fd4d77c10d6b15586aa727 Author: Avi Kivity Date: Thu May 31 18:28:51 2007 +0300 KVM: MMU: Remove unused large page marker This has not been used for some time, as the same information is available in the page header. Signed-off-by: Avi Kivity commit 21e3670e57c34809d4c141ce1dde4fd8b23a4d60 Author: Avi Kivity Date: Thu May 31 18:24:09 2007 +0300 KVM: MMU: Don't cache guest access bits in the shadow page table This was once used to avoid accessing the guest pte when upgrading the shadow pte from read-only to read-write. But usually we need to set the guest pte dirty or accessed bits anyway, so this wasn't really exploited. Signed-off-by: Avi Kivity commit 319d035ef290b510edb7f848d41098c31ceaace0 Author: Avi Kivity Date: Thu May 31 18:20:14 2007 +0300 KVM: MMU: Simpify accessed/dirty/present/nx bit handling Always set the accessed and dirty bit (since having them cleared causes a read-modify-write cycle), always set the present bit, and copy the nx bit from the guest. Signed-off-by: Avi Kivity commit 080e7fd753ec60140ea89ebb0ea94625ae541534 Author: Avi Kivity Date: Thu May 31 17:17:06 2007 +0300 KVM: MMU: Remove cr0.wp tricks No longer needed as we do everything in one place. Signed-off-by: Avi Kivity commit cc9d465c7a9ef3a109814fa866676f876ff42133 Author: Avi Kivity Date: Thu May 31 15:46:04 2007 +0300 KVM: MMU: Make setting shadow ptes atomic on i386 Signed-off-by: Avi Kivity commit 823c30e8740ad71bd9556f3cd235231ad00bfa55 Author: Avi Kivity Date: Thu May 31 15:23:35 2007 +0300 KVM: Make shadow pte updates atomic With guest smp, a second vcpu might see partial updates when the first vcpu services a page fault. So delay all updates until we have figured out what the pte should look like. Note that on i386, this is still not completely atomic as a 64-bit write will be split into two on a 32-bit machine. Signed-off-by: Avi Kivity commit b7bd6888968e797f2deaa4aa9f98466a2371392b Author: Avi Kivity Date: Thu May 31 15:14:09 2007 +0300 KVM: Move shadow pte modifications from set_pte/set_pde to set_pde_common() We want all shadow pte modifications in one place. Signed-off-by: Avi Kivity commit b70ccb0b3fd4ac02c0f6cf5153008c736fa27710 Author: Avi Kivity Date: Thu May 31 15:08:29 2007 +0300 KVM: MMU: Fold fix_write_pf() into set_pte_common() This prevents some work from being performed twice, and, more importantly, reduces the number of places where we modify shadow ptes. Signed-off-by: Avi Kivity commit ad5555224aa01b2ddcc45ab9f0172b5497a7cd5d Author: Avi Kivity Date: Thu May 31 11:56:54 2007 +0300 KVM: MMU: Fold fix_read_pf() into set_pte_common() Signed-off-by: Avi Kivity commit 3f1380d422cbd5b9231c3e997e4cbec000e3a08f Author: Avi Kivity Date: Thu May 31 11:45:18 2007 +0300 KVM: MMU: Pass the guest pde to set_pte_common We will need the accessed bit (in addition to the dirty bit) and also write access (for setting the dirty bit) in a future patch. Signed-off-by: Avi Kivity commit 5fe13ee0e2b404dd34dea17ec0849b4a940a5755 Author: Avi Kivity Date: Wed May 30 19:31:17 2007 +0300 KVM: MMU: Move set_pte_common() to pte width dependent code In preparation of some modifications. Signed-off-by: Avi Kivity commit 5ada0f87635fa10a40a22b8b249c3d1fedb79840 Author: Avi Kivity Date: Wed May 30 14:21:51 2007 +0300 KVM: MMU: Simplify fetch() a little bit Signed-off-by: Avi Kivity commit 67310badceaed0519cb8efbe6054d790563ea136 Author: Avi Kivity Date: Wed May 30 12:34:53 2007 +0300 KVM: MMU: Use slab caches for shadow pages and their headers Use slab caches instead of a simple custom list. Signed-off-by: Avi Kivity commit 6d9d80f421f77da043b8b6898e01327763adecd2 Author: Eddie Dong Date: Tue May 29 15:07:21 2007 +0300 KVM: Use symbolic constants instead of magic numbers Signed-off-by: Avi Kivity commit 4eaa906699812e2e28c3237cfedd8c21cbd17c4b Author: Markus Rechberger Date: Sun May 27 10:46:52 2007 +0300 KVM: Fix includes KVM compilation fails for some .configs. This fixes it. Signed-off-by: Markus Rechberger Signed-off-by: Avi Kivity commit d67c455e06a1eaf8ab20b5c4e51f4ae8271b2637 Author: Avi Kivity Date: Thu May 24 11:17:33 2007 +0300 KVM: x86 emulator: implement wbinvd Vista seems to trigger it. Signed-off-by: Avi Kivity commit fc1193d546ec21c279a8e4e3e9eaf999275b2223 Author: Jan Engelhardt Date: Wed May 23 14:22:11 2007 -0700 Use menuconfig objects II - KVM/Virt Make a "menuconfig" out of the Kconfig objects "menu, ..., endmenu", so that the user can disable all the options in that menu at once instead of having to disable each option separately. Signed-off-by: Jan Engelhardt Signed-off-by: Andrew Morton Signed-off-by: Avi Kivity commit a6935dbdaa7278d5e4a4d7478f29462f2a5db7fe Author: Avi Kivity Date: Mon May 21 09:15:47 2007 +0300 KVM: VMX: Remove warnings on i386 Signed-off-by: Avi Kivity commit 1ab29f3fb765b08e65de563d9053d4d05cc95f52 Author: Eddie Dong Date: Mon May 21 07:28:09 2007 +0300 KVM: VMX: Avoid saving and restoring msr_efer on lightweight vmexit MSR_EFER.LME/LMA bits are automatically save/restored by VMX hardware, KVM only needs to save NX/SCE bits at time of heavy weight VM Exit. But clearing NX bits in host envirnment may cause system hang if the host page table is using EXB bits, thus we leave NX bits as it is. If Host NX=1 and guest NX=0, we can do guest page table EXB bits check before inserting a shadow pte (though no guest is expecting to see this kind of gp fault). If host NX=0, we present guest no Execute-Disable feature to guest, thus no host NX=0, guest NX=1 combination. This patch reduces raw vmexit time by ~27%. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 64ce9a0cf0960f9a029e54d1bffc06123d3b5893 Author: Eddie Dong Date: Sun May 20 16:28:59 2007 +0300 KVM: VMX: Fix a typo which mixes X86_64 and CONFIG_X86_64 This prevents compilation on 64-bits. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit cc1d717e078464a049cf8364417ec44267cd6143 Author: Eddie Dong Date: Sun May 20 10:50:08 2007 +0300 KVM: VMX: Cleanup redundant code in MSR set Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 8bf50c5c6b2af81355412ec1696a7e2c8ad940f2 Author: Daniel Hecken Date: Sun May 20 10:32:14 2007 +0300 KVM: VMX: Compile-fix for 32-bit hosts Signed-off-by: Avi Kivity commit f552bf62c86b383dd74030c5830c8043bf41e0bd Author: Eddie Dong Date: Thu May 17 18:55:15 2007 +0300 KVM: VMX: Avoid saving and restoring msrs on lightweight vmexit In a lightweight exit (where we exit and reenter the guest without scheduling or exiting to userspace in between), we don't need various msrs on the host, and avoiding shuffling them around reduces raw exit time by 8%. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Avi Kivity commit 8edb11391b763357734cc5fd293d788d8591e6d7 Author: Nitin A Kamble Date: Thu May 17 15:50:34 2007 +0300 KVM: VMX: Handle #SS faults from real mode Instructions with address size override prefix opcode 0x67 Cause the #SS fault with 0 error code in VM86 mode. Forward them to the emulator. Signed-Off-By: Nitin A Kamble Signed-off-by: Avi Kivity commit bdf3f418471ba3c65aa78a1943da179d8320fdf8 Author: Avi Kivity Date: Mon May 14 20:41:13 2007 +0300 KVM: VMX: Use local labels in inline assembly This makes oprofile dumps and disassebly easier to read. Signed-off-by: Avi Kivity commit ca76d209b88c344fc6a8eac17057c0088a3d6940 Author: Avi Kivity Date: Sun May 13 20:18:14 2007 +0300 KVM: Remove merge artifact Signed-off-by: Avi Kivity commit 52916bb7c142b5cf8a81da225bf51c2ea60c5b49 Author: Avi Kivity Date: Tue May 8 11:34:07 2007 +0300 KVM: Fix vmx I/O bitmap initialization on highmem systems kunmap() expects a struct page, not a virtual address. Fixes an oops loading kvm-intel.ko on i386 with CONFIG_HIGHMEM. Thanks to Michael Ivanov for reporting. Signed-off-by: Avi Kivity commit facc2faaf471ca539ddd96fdbdf2e147421468a6 Author: Avi Kivity Date: Mon May 7 10:55:37 2007 +0300 KVM: Avoid corrupting tr in real mode The real mode tr needs to be set to a specific tss so that I/O instructions can function. Divert the new tr values to the real mode save area from where they will be restored on transition to protected mode. This fixes some crashes on reboot when the bios accesses an I/O instruction. Signed-off-by: Avi Kivity commit 05eb943c9b547ecc4de850f04ed4c09356440528 Author: Avi Kivity Date: Sun May 6 16:10:01 2007 +0300 KVM: VMX: Only reload guest msrs if they are already loaded If we set an msr via an ioctl() instead of by handling a guest exit, we have the host state loaded, so reloading the msrs would clobber host state instead of guest state. This fixes a host oops (and loss of a cpu) on a guest reboot. Signed-off-by: Avi Kivity commit 242b0f9ae76651226fb42d9ec3ecb1a9d8d7b263 Author: Avi Kivity Date: Sun May 6 15:50:58 2007 +0300 KVM: MMU: Store shadow page tables as kernel virtual addresses, not physical Simpifies things a bit. Signed-off-by: Avi Kivity commit 03aeb06a4440265777ae4ed62e8431955cbea865 Author: Avi Kivity Date: Sun May 6 15:36:30 2007 +0300 KVM: MMU: Simplify kvm_mmu_free_page() a tiny bit Signed-off-by: Avi Kivity commit f66b4a983d460d68ef5cc392285190065b0617e5 Author: Matthew Gregan Date: Sun May 6 10:59:46 2007 +0300 KVM: Implement IA32_EBL_CR_POWERON msr Attempting to boot the default 'bsd' kernel of OpenBSD 4.1 i386 in a guest fails early in the kernel init inside p3_get_bus_clock while trying to read the IA32_EBL_CR_POWERON MSR. KVM logs an 'unhandled MSR' message and the guest kernel faults. This patch is sufficient to allow OpenBSD to boot, after which it seems to run fine. I'm not sure if this is the correct solution for dealing with this particular MSR, but it works for me. Signed-off-by: Matthew Gregan Signed-off-by: Avi Kivity commit 7a57011a5e7c4082fdfd204115a8212298ef723f Author: Avi Kivity Date: Wed May 2 23:06:22 2007 +0300 KVM: Set cr0.mp for guests This allows fwait instructions to be trapped when the guest fpu is not loaded. Signed-off-by: Avi Kivity commit 90fb720a59dafb11d591a8e53a4a65bfa6fcfea9 Author: Avi Kivity Date: Wed May 2 22:57:13 2007 +0300 KVM: Ensure host cr0.ts is saved Otherwise, host fpu state may be corrupted after an exit. Signed-off-by: Avi Kivity commit 7616f59b208b088afd85d40aa06ca6d4d4a6ca1a Author: Avi Kivity Date: Wed May 2 20:40:00 2007 +0300 KVM: Consolidate guest fpu activation and deactivation Easier to keep track of where the fpu is this way. Signed-off-by: Avi Kivity commit 7ca14868fd7f3c0dc21450e61cca5b77a47daf0d Author: Avi Kivity Date: Wed May 2 17:57:40 2007 +0300 KVM: Rationalize exception bitmap usage Everyone owns a piece of the exception bitmap, but they happily write to the entire thing like there's no tomorrow. Centralize handling in update_exception_bitmap() and have everyone call that. Signed-off-by: Avi Kivity commit de32f820227fbe3e159ec42ce8fd55057155edca Author: Avi Kivity Date: Wed May 2 17:33:43 2007 +0300 KVM: Move some more msr mangling into vmx_save_host_state() Signed-off-by: Avi Kivity commit fa580ecc53536620546659740ae2dfcea763d17c Author: Avi Kivity Date: Wed May 2 17:30:48 2007 +0300 KVM: Prevent guest fpu state from leaking into the host The lazy fpu changes did not take into account that some vmexit handlers can sleep. Move loading the guest state into the inner loop so that it can be reloaded if necessary, and move loading the host state into vmx_vcpu_put() so it can be performed whenever we relinquish the vcpu. Signed-off-by: Avi Kivity commit bc8dcc2107de0ba8f25fc910c4559ebe3df33045 Author: Avi Kivity Date: Wed May 2 16:54:03 2007 +0300 KVM: Fix potential guest state leak into host The lightweight vmexit path avoids saving and reloading certain host state. However in certain cases lightweight vmexit handling can schedule() which requires reloading the host state. So we store the host state in the vcpu structure, and reloaded it if we relinquish the vcpu. Signed-off-by: Avi Kivity commit 11bdaf6e26c0cbabd9b6c8f2e9de60190815d348 Author: Avi Kivity Date: Tue May 1 18:24:38 2007 +0300 KVM: Increase mmu shadow cache to 1024 pages This improves kbuild times by about 10%, bringing it within a respectable 25% of native. Signed-off-by: Avi Kivity commit d6540cdffea466f1ee17a52ef530d40577b476b2 Author: Avi Kivity Date: Tue May 1 16:53:31 2007 +0300 KVM: Update shadow pte on write to guest pte A typical demand page/copy on write pattern is: - page fault on vaddr - kvm propagates fault to guest - guest handles fault, updates pte - kvm traps write, clears shadow pte, resumes guest - guest returns to userspace, re-faults on same vaddr - kvm installs shadow pte, resumes guest - guest continues So, three vmexits for a single guest page fault. But if instead of clearing the page table entry, we update to correspond to the value that the guest has just written, we eliminate the third vmexit. This patch does exactly that, reducing kbuild time by about 10%. Signed-off-by: Avi Kivity commit 807762acc40f7cc16aefcfaef8a596a4af988b20 Author: Avi Kivity Date: Tue May 1 16:44:05 2007 +0300 KVM: MMU: Respect nonpae pagetable quadrant when zapping ptes When a guest writes to a page that has an mmu shadow, we have to clear the shadow pte corresponding to the memory location touched by the guest. Now, in nonpae mode, a single guest page may have two or four shadow pages (because a nonpae page maps 4MB or 4GB, whereas the pae shadow maps 2MB or 1GB), so we when we look up the page we find up to three additional aliases for the page. Since we _clear_ the shadow pte, it doesn't matter except for a slight performance penalty, but if we want to _update_ the shadow pte instead of clearing it, it is vital that we don't modify the aliases. Fortunately, exactly which page is needed (the "quadrant") is easily computed, and is accessible in the shadow page header. All we need is to ignore shadow pages from the wrong quadrants. Signed-off-by: Avi Kivity commit 4a5c1655c9f6df8c668428d3c5d2ad4f67dce08d Author: Avi Kivity Date: Tue May 1 14:16:52 2007 +0300 KVM: Unify kvm_mmu_pre_write() and kvm_mmu_post_write() Instead of calling two functions and repeating expensive checks, call one function and provide it with before/after information. Signed-off-by: Avi Kivity commit ff31cf26ff8e17c2f7164c39dc03fe309ed36506 Author: Avi Kivity Date: Tue May 1 11:32:28 2007 +0300 KVM: Be more careful restoring fs on lightweight vmexit i386 wants fs for accessing the pda even on a lightweight exit, so ensure we can always restore it. This fixes a regression on i386 introduced by the lightweight vmexit patch. Signed-off-by: Avi Kivity commit e6d2f6292194c931b2fa11373a66d640245e1b14 Author: Avi Kivity Date: Mon Apr 30 17:05:38 2007 +0300 KVM: Reduce misfirings of the fork detector The kvm mmu tries to detects forks by looking for repeated writes to a page table. If it sees a fork, it unshadows the page table so the page table copying can proceed at native speed instead of being emulated. However, the detector also triggered on simple demand paging access patterns: a linear walk of memory would of course cause repeated writes to the same pagetable page, causing it to unshadow prematurely. Fix by resetting the fork detector if we detect a demand fault. Signed-off-by: Avi Kivity commit f908e27039ab637013ad17c64e4ef77c4c0a24b8 Author: Avi Kivity Date: Mon Apr 30 16:15:58 2007 +0300 KVM: Unindent some code Signed-off-by: Avi Kivity commit 5cf48c367dec74ba8553c53ed332cd075fa38b88 Author: Avi Kivity Date: Mon Apr 30 16:07:54 2007 +0300 KVM: Avoid saving and restoring some host CPU state on lightweight vmexit Many msrs and the like will only be used by the host if we schedule() or return to userspace. Therefore, we avoid saving them if we handle the exit within the kernel, and if a reschedule is not requested. Based on a patch from Eddie Dong with a couple of fixes by me. Signed-off-by: Yaozu(Eddie) Dong Signed-off-by: Avi Kivity commit 2d8d6944a2249f642420bbc70b199182c70ebc9a Author: Avi Kivity Date: Mon Apr 30 14:47:02 2007 +0300 KVM: Assume that writes smaller than 4 bytes are to non-pagetable pages This allows us to remove write protection earlier than otherwise. Should some mad OS choose to use byte writes to update pagetables, it will suffer a performance hit, but still work correctly. Signed-off-by: Avi Kivity commit 7d0e7eed6200c54462e884abc8dd6681df2f5e7d Author: Avi Kivity Date: Mon Apr 30 12:42:43 2007 +0300 KVM: Fix RMW mmio handling Commit 9bf671a47ed6af3164524a31dbef9360f1b66fb5 optimized the mmio read path by returning to the emulator directly after an mmio read request. But we may also need to return back to userspace in case the instruction was a read-modify-write instruction, which means we need to issue a write after completion of the read instead of returning to the guest. Signed-off-by: Avi Kivity commit f05f41f9bb1cf72a13caf61c2931dbbf4bff51eb Author: Anthony Liguori Date: Mon Apr 30 09:48:11 2007 +0300 KVM: SVM: Allow direct guest access to PC debug port The PC debug port is used for IO delay and does not require emulation. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 99c7b51d71c0b0062b752c5f0a4b3498d3d165db Author: He, Qing Date: Mon Apr 30 09:45:24 2007 +0300 KVM: VMX: Enable io bitmaps to avoid IO port 0x80 VMEXITs This patch enables IO bitmaps control on vmx and unmask the 0x80 port to avoid VMEXITs caused by accessing port 0x80. 0x80 is used as delays (see include/asm/io.h), and handling VMEXITs on its access is unnecessary but slows things down. This patch improves kernel build test at around 3%~5%. Because every VM uses the same io bitmap, it is shared between all VMs rather than a per-VM data structure. Signed-off-by: Qing He Signed-off-by: Avi Kivity commit c06d7c14c006c5e2dcd2a7d84603b51e9e60d7a7 Author: Avi Kivity Date: Sun Apr 29 16:25:49 2007 +0300 KVM: Remove unused 'instruction_length' As we no longer emulate in userspace, this is meaningless. We don't compute it on SVM anyway. Signed-off-by: Avi Kivity commit 20426d1309353b3e2771f9c7f534e01ce7a019f2 Author: Avi Kivity Date: Sun Apr 29 15:02:17 2007 +0300 KVM: Don't require explicit indication of completion of mmio or pio It is illegal not to return from a pio or mmio request without completing it, as mmio or pio is an atomic operation. Therefore, we can simplify the userspace interface by avoiding the completion indication. Signed-off-by: Avi Kivity commit 9bf671a47ed6af3164524a31dbef9360f1b66fb5 Author: Avi Kivity Date: Wed Mar 14 15:54:54 2007 +0200 KVM: Remove extraneous guest entry on mmio read When emulating an mmio read, we actually emulate twice: once to determine the physical address of the mmio, and, after we've exited to userspace to get the mmio value, we emulate again to place the value in the result register and update any flags. But we don't really need to enter the guest again for that, only to take an immediate vmexit. So, if we detect that we're doing an mmio read, emulate a single instruction before entering the guest again. Signed-off-by: Avi Kivity commit 8dfdb0d81fb9e858c14e03fd5e007b20167cd065 Author: Avi Kivity Date: Sun Apr 29 13:01:34 2007 +0300 KVM: Remove trailing whitespace Signed-off-by: Avi Kivity commit 1628bcc25417eae4c83ca87e0899c7e02961d975 Author: Signed-off-by: Anthony Liguori Date: Sun Apr 29 11:56:06 2007 +0300 KVM: SVM: Only save/restore MSRs when needed We only have to save/restore MSR_GS_BASE on every VMEXIT. The rest can be saved/restored when we leave the VCPU. Since we don't emulate the DEBUGCTL MSRs and the guest cannot write to them, we don't have to worry about saving/restoring them at all. This shaves a whopping 40% off raw vmexit costs on AMD. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 68ba823bbe6d546e3ceb63d006c62a84e92837db Author: Adrian Bunk Date: Sat Apr 28 21:20:48 2007 +0200 KVM: fix an if() condition It might have worked in this case since PT_PRESENT_MASK is 1, but let's express this correctly. Signed-off-by: Adrian Bunk Signed-off-by: Avi Kivity commit fe7dc1f2c0c3d0c21abf9dfa4387f0b748080688 Author: Anthony Liguori Date: Fri Apr 27 09:29:49 2007 +0300 KVM: VMX: Add lazy FPU support for VT Only save/restore the FPU host state when the guest is actually using the FPU. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit 4a579478e5259df8828a8b9e5b3ddac2a946ce88 Author: Anthony Liguori Date: Fri Apr 27 09:29:21 2007 +0300 KVM: VMX: Properly shadow the CR0 register in the vcpu struct Set all of the host mask bits for CR0 so that we can maintain a proper shadow of CR0. This exposes CR0.TS, paving the way for lazy fpu handling. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit aad1187a6c0201701026cdb2f7f6eeb49b2af4a2 Author: Avi Kivity Date: Wed Apr 25 16:57:46 2007 +0300 KVM: Move need_resched() check to common code Pointed out by Anthony Liguori. Signed-off-by: Avi Kivity commit b08487bd204708241c9b71ebfc555e334a4e4711 Author: Eddie Dong Date: Wed Apr 25 16:49:19 2007 +0300 KVM: VMX: Avoid unnecessary vcpu_load()/vcpu_put() cycles By checking if a reschedule is needed, we avoid dropping the vcpu. Signed-off-by: Avi Kivity commit 25900fd20d141145348178ffe91948e47c83e2ab Author: Avi Kivity Date: Wed Apr 25 11:51:06 2007 +0300 KVM: Avoid unused function warning due to assertion removal Signed-off-by: Avi Kivity commit 2bd9b992631841b1be5883a5c27b9c58ae9bb96a Author: Avi Kivity Date: Wed Apr 25 11:48:45 2007 +0300 KVM: We want asserts on debug builds, not release Noticed by Michael Riepe. Signed-off-by: Avi Kivity commit c3efc3ab86aa651106f6302592e25c7ab8285c35 Author: Avi Kivity Date: Thu Apr 12 13:03:01 2007 +0300 KVM: Initialize cr0 to indicate an fpu is present Solaris panics if it sees a cpu with no fpu, and it seems to rely on this bit. Closes sourceforge bug 1698920. Signed-off-by: Avi Kivity commit 28b183145d34a8ad1bc462df565165a88bcb5220 Author: Yaozu Dong Date: Wed Apr 25 14:17:25 2007 +0800 KVM: MMU: Avoid heavy ASSERT at non debug mode. Signed-off-by: Avi Kivity commit 418987aef13b475140b76f9f780046d63eb16f86 Author: Avi Kivity Date: Wed Apr 25 11:01:28 2007 +0300 KVM: Document MSR_K6_STAR's special place in the msr index array Signed-off-by: Avi Kivity commit 90ca9e3d54c8b0ac2023c624d1c7260bb8926beb Author: Avi Kivity Date: Wed Apr 25 10:59:52 2007 +0300 KVM: Don't complain about cpu erratum AA15 It slows down Windows x64 horribly. Signed-off-by: Avi Kivity commit 6f19cb49965e1316b285a443c9392031b1634f2e Author: Avi Kivity Date: Tue Apr 24 14:13:01 2007 +0300 KVM: Fix msr-avoidance regression on Core processors Core processors don't have the STAR msr, so the attempt not to save it caused an underflow in the number of msrs. Fix by only avoiding the STAR msr if it is actually present. Signed-off-by: Avi Kivity commit ccf9e2f22e5caf6274b5e9aafd9814a32ef049d5 Author: Anthony Liguori Date: Mon Apr 23 09:17:21 2007 -0500 KVM: Lazy FPU support for SVM Avoid saving and restoring the guest fpu state on every exit. This shaves ~100 cycles off the guest/host switch. Signed-off-by: Anthony Liguori Signed-off-by: Avi Kivity commit d558e0b49319cfc9aa92e9b7215580f265a2ead7 Author: Avi Kivity Date: Sun Apr 22 15:28:19 2007 +0300 KVM: Allow passing 64-bit values to the emulated read/write API This simplifies the API somewhat (by eliminating the special-case cmpxchg8b on i386). Signed-off-by: Avi Kivity commit 551284356a39f20de70cd5556e85ae92080aec8c Author: Avi Kivity Date: Fri Apr 20 13:41:09 2007 +0300 KVM: Silence compile warning on i386 Signed-off-by: Avi Kivity commit 459377fe9ba4a307144ead3ad86993cdee9f8fe8 Author: Avi Kivity Date: Thu Apr 19 17:27:43 2007 +0300 KVM: Per-vcpu statistics Make the exit statistics per-vcpu instead of global. This gives a 3.5% boost when running one virtual machine per core on my two socket dual core (4 cores total) machine. Signed-off-by: Avi Kivity commit 5c828f83928f186320d74627089122ebc9ea98ce Author: Avi Kivity Date: Thu Apr 19 14:28:44 2007 +0300 KVM: VMX: Only save/restore MSR_K6_STAR if necessary Intel hosts only support syscall/sysret in long more (and only if efer.sce is enabled), so only reload the related MSR_K6_STAR if the guest will actually be able to use it. This reduces vmexit cost by about 500 cycles (6400 -> 5870) on my setup. Signed-off-by: Avi Kivity commit 37d6247b3636cbf47014694483d2d25c3806e8f2 Author: Avi Kivity Date: Thu Apr 19 13:26:39 2007 +0300 KVM: Fold drivers/kvm/kvm_vmx.h into drivers/kvm/vmx.c No meat in that file. Signed-off-by: Avi Kivity commit ba9c2fc1015a2b2f1f930274d465662ed8b860e6 Author: Avi Kivity Date: Thu Apr 19 13:22:48 2007 +0300 KVM: VMX: Don't switch 64-bit msrs for 32-bit guests Some msrs are only used by x86_64 instructions, and are therefore not needed when the guest is legacy mode. By not bothering to switch them, we reduce vmexit latency by 2400 cycles (from about 8800) when running a 32-bt guest on a 64-bit host. Signed-off-by: Avi Kivity commit 8d6c8a0d891f8c37889f28f368c2621f85e50035 Author: Avi Kivity Date: Wed Apr 18 11:18:18 2007 +0300 KVM: Fix off-by-one when writing to a nonpae guest pde Nonpae guest pdes are shadowed by two pae ptes, so we double the offset twice: once to account for the pte size difference, and once because we need to shadow pdes for a single guest pde. But when writing to the upper guest pde we also need to truncate the lower bits, otherwise the multiply shifts these bits into the pde index and causes an access to the wrong shadow pde. If we're at the end of the page (accessing the very last guest pde) we can even overflow into the next host page and oops. Signed-off-by: Avi Kivity commit f0b9c908fa1451147a07f2f4e4a9409fb7b14160 Author: Avi Kivity Date: Tue Apr 17 15:30:24 2007 +0300 KVM: VMX: Reduce unnecessary saving of host msrs THe automatically switched msrs are never changed on the host (with the exception of MSR_KERNEL_GS_BASE) and thus there is no need to save them on every vm entry. This reduces vmexit latency by ~400 cycles on i386 and by ~900 cycles (10%) on x86_64. Signed-off-by: Avi Kivity commit 7368e6550cdf72b0ad1b68dbe923f85e37ef4d08 Author: Avi Kivity Date: Tue Apr 17 10:53:22 2007 +0300 KVM: Handle guest page faults when emulating mmio Usually, guest page faults are detected by the kvm page fault handler, which detects if they are shadow faults, mmio faults, pagetable faults, or normal guest page faults. However, in ceratin circumstances, we can detect a page fault much later. One of these events is the following combination: - A two memory operand instruction (e.g. movsb) is executed. - The first operand is in mmio space (which is the fault reported to kvm) - The second operand is in an ummaped address (e.g. a guest page fault) The Windows 2000 installer does such an access, an promptly hangs. Fix by adding the missing page fault injection on that path. Signed-off-by: Avi Kivity commit 894f5a5efc0c48482eb10ad48891054a659e5941 Author: Avi Kivity Date: Mon Apr 16 14:28:40 2007 +0300 KVM: SVM: Report hardware exit reason to userspace instead of dmesg Signed-off-by: Avi Kivity commit 94d806a6efd4401ce43358af6a9e8df5a63151ae Author: Avi Kivity Date: Mon Apr 16 13:36:10 2007 +0300 KVM: Fix pio completion Check cur_count instead of count to avoid false completions. Signed-off-by: Avi Kivity commit d3344ae6f6293913d6e4f230ebee0b370f2e3f98 Author: Avi Kivity Date: Mon Apr 16 11:53:17 2007 +0300 KVM: Retry sleeping allocation if atomic allocation fails This avoids -ENOMEM under memory pressure. Signed-off-by: Avi Kivity commit 327585c3b4c1d6b04bb752f70f350d98ca855080 Author: Avi Kivity Date: Sun Apr 15 16:31:09 2007 +0300 KVM: Use slab caches to allocate mmu data structures Better leak detection, statistics, memory use, speed -- goodness all around. Signed-off-by: Avi Kivity commit 3079541923d2cdf702490eff7081610b7320e37f Author: Avi Kivity Date: Sun Apr 15 15:48:11 2007 +0300 KVM: Fix string pio when count == 0 Surprisingly, VT traps when executing a string pio instruction with zero count. Perhaps more surprisingly, the Windows ne2000 driver issues such instructions. Since we aren't prepared to handle completions of these instructions, avoid the entire mess by continuing execution without escaping to userspace. This fixes the networking problems reported by Leslie Mann with recent versions of kvm. Signed-off-by: Avi Kivity commit 3ef1110c81993e01343e1b473f5d7d1a23e6a8a3 Author: Avi Kivity Date: Thu Apr 12 17:35:58 2007 +0300 KVM: Handle partial pae pdptr Some guests (Solaris) do not set up all four pdptrs, but leave some invalid. kvm incorrectly treated these as valid page directories, pinning the wrong pages and causing general confusion. Fix by checking the valid bit of a pae pdpte. This closes sourceforge bug 1698922. Signed-off-by: Avi Kivity commit 4e9d9d330d9c9e66c449be10950562e407366a73 Author: Avi Kivity Date: Wed Apr 11 19:04:39 2007 +0300 KVM: Fix memory leak on pio completion We get_page() the pages participating in pio before we return to userspace, yet we neglect to free them. The can leak all guest memory in a few seconds by doing a hdparm -d 0 /dev/hda; dd < /dev/hda > /dev/null on the guest. Signed-off-by: Avi Kivity commit b630b9c6819844e29cddcfeaee901f6ada5d571b Author: Eric Sesterhenn / Snakebyte Date: Mon Apr 9 16:15:05 2007 +0200 KVM: Fix overflow bug in overflow detection code The expression sp - 6 < sp where sp is a u16 is undefined in C since 'sp - 6' is promoted to int, and signed overflow is undefined in C. gcc 4.2 actually warns about it. Replace with a simpler test. Signed-off-by: Eric Sesterhenn Signed-off-by: Avi Kivity commit c338c271f150ab2ded369ef4c1882f85b28af709 Author: Avi Kivity Date: Mon Apr 2 13:05:50 2007 +0300 KVM: Use kernel-standard types Noted by Joerg Roedel. Signed-off-by: Avi Kivity commit 0ea6eecef44923d66409a49d71e4fa87fa0f5bed Author: Avi Kivity Date: Sun Apr 1 16:34:31 2007 +0300 KVM: Add fpu get/set operations These are really helpful when migrating an floating point app to another machine. Signed-off-by: Avi Kivity commit 05671a064c73b8cb8966ddd037ece2d6ae2cb75b Author: Avi Kivity Date: Fri Mar 30 16:54:30 2007 +0300 KVM: Add physical memory aliasing feature With this, we can specify that accesses to one physical memory range will be remapped to another. This is useful for the vga window at 0xa0000 which is used as a movable window into the (much larger) framebuffer. Signed-off-by: Avi Kivity commit 8e08039818b6a5b8c81b905f863adaa18d774171 Author: Avi Kivity Date: Fri Mar 30 14:02:32 2007 +0300 KVM: Simply gfn_to_page() Mapping a guest page to a host page is a common operation. Currently, one has first to find the memory slot where the page belongs (gfn_to_memslot), then locate the page itself (gfn_to_page()). This is clumsy, and also won't work well with memory aliases. So simplify gfn_to_page() not to require memory slot translation first, and instead do it internally. Signed-off-by: Avi Kivity commit 66a9932c55ff7240955d57b7d1e62178a9e80868 Author: Dor Laor Date: Fri Mar 30 13:06:33 2007 +0300 Add mmu cache clear function Functions that play around with the physical memory map need a way to clear mappings to possibly nonexistent or invalid memory. Both the mmu cache and the processor tlb are cleared. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 6095d7b8291fc3e05f3b8790a9bc86b54af281a2 Author: Joerg Roedel Date: Fri Mar 30 17:02:14 2007 +0300 KVM: SVM: enable LBRV virtualization if available This patch enables the virtualization of the last branch record MSRs on SVM if this feature is available in hardware. It also introduces a small and simple check feature for specific SVM extensions. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 8f1469e8477bea483d5a6348a30a534449048c8d Author: Avi Kivity Date: Wed Mar 28 20:04:16 2007 +0200 KVM: x86 emulator: fix bit string operations operand size On x86, bit operations operate on a string of bits that can reside in multiple words. For example, 'btsl %eax, (blah)' will touch the word at blah+4 if %eax is between 32 and 63. The x86 emulator compensates for that by advancing the operand address by (bit offset / BITS_PER_LONG) and truncating the bit offset to the range (0..BITS_PER_LONG-1). This has a side effect of forcing the operand size to 8 bytes on 64-bit hosts. Now, a 32-bit guest goes and fork()s a process. It write protects a stack page at 0xbffff000 using the 'btr' instruction, at offset 0xffc in the page table, with bit offset 1 (for the write permission bit). The emulator now forces the operand size to 8 bytes as previously described, and an innocent page table update turns into a cross-page-boundary write, which is assumed by the mmu code not to be a page table, so it doesn't actually clear the corresponding shadow page table entry. The guest and host permissions are out of sync and guest memory is corrupted soon afterwards, leading to guest failure. Fix by not using BITS_PER_LONG as the word size; instead use the actual operand size, so we get a 32-bit write in that case. Note we still have to teach the mmu to handle cross-page-boundary writes to guest page table; but for now this allows Damn Small Linux 0.4 (2.4.20) to boot. Signed-off-by: Avi Kivity commit e3a065c4e99bb8282d72a2c3c75234d7d7408be6 Author: Avi Kivity Date: Tue Mar 27 17:50:20 2007 +0200 KVM: Remove debug message No longer interesting. Signed-off-by: Avi Kivity commit 19cd40d605bb99fc9058973a69ef208c8b5b1e42 Author: Avi Kivity Date: Tue Mar 27 16:12:41 2007 +0200 Revert "added KVM_GET_MEM_MAP ioctl to get the memory bitmap for a memory slot" This reverts commit ade11a015f83d270d1201c440199146f852fe5e4. As the balloon path will be through qemu, it will have direct knowledge of released gfns, so this API is not directly needed. If it becomes useful in the future, it will be un-reverted. Signed-off-by: Avi Kivity commit 932bf20c0c2075f958bb86b481d8f359197b4d6a Author: Avi Kivity Date: Mon Mar 26 19:31:52 2007 +0200 KVM: Use list_move() Use list_move() where possible. Noticed by Dor Laor. Signed-off-by: Avi Kivity commit 31e82571e8a77d5feb1093627ef0b31f28649590 Author: Michal Piotrowski Date: Sun Mar 25 17:59:32 2007 +0200 KVM: Remove unused function Remove unused function CC drivers/kvm/svm.o drivers/kvm/svm.c:207: warning: ‘inject_db’ defined but not used Signed-off-by: Michal Piotrowski Signed-off-by: Avi Kivity commit 9207113c121519986a114ee5c498184e618ffd68 Author: Avi Kivity Date: Sun Mar 25 12:07:27 2007 +0200 KVM: SVM: Ensure timestamp counter monotonicity When a vcpu is migrated from one cpu to another, its timestamp counter may lose its monotonic property if the host has unsynced timestamp counters. This can confuse the guest, sometimes to the point of refusing to boot. As the rdtsc instruction is rather fast on AMD processors (7-10 cycles), we can simply record the last host tsc when we drop the cpu, and adjust the vcpu tsc offset when we detect that we've migrated to a different cpu. Signed-off-by: Avi Kivity commit b40faf227eb371a52aa21d08f8e9c33fc06602b4 Author: Avi Kivity Date: Fri Mar 23 09:55:25 2007 +0200 KVM: MMU: Fix hugepage pdes mapping same physical address with different access The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map the same physical address, they share the same shadow page. This is a fairly common case (kernel mappings on i386 nonpae Linux, for example). However, if the two pdes map the same memory but with different permissions, kvm will happily use the cached shadow page. If the access through the more permissive pde will occur after the access to the strict pde, an endless pagefault loop will be generated and the guest will make no progress. Fix by making the access permissions part of the cache lookup key. The fix allows Xen pae to boot on kvm and run guest domains. Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix. Signed-off-by: Avi Kivity commit 061bba1190514205594d2046f5dc31a01a135163 Author: Avi Kivity Date: Thu Mar 22 15:10:32 2007 +0200 Revert "KVM: Remove extraneous guest entry on mmio read" This reverts commit b0092d187cfa19dfcada3b85d728af5ae27989dc. While the optimization is sound, it regresses booting the Fedora Core 6 32 bit kernel. Signed-off-by: Avi Kivity commit 4cec1674d1436157c7dcc2b5b6f625b08b2b96e8 Author: Joerg Roedel Date: Wed Mar 21 19:47:00 2007 +0100 KVM: SVM: forbid guest to execute monitor/mwait This patch forbids the guest to execute monitor/mwait instructions on SVM. This is necessary because the guest can execute these instructions if they are available even if the kvm cpuid doesn't report its existence. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 7921ad9e303f3f03dd81b552e3b0cd87ef355219 Author: Sergey Kiselev Date: Thu Mar 22 14:06:18 2007 +0200 KVM: Handle writes to MCG_STATUS msr Some older (~2.6.7) kernels write MCG_STATUS register during kernel boot (mce_clear_all() function, called from mce_init()). It's not currently handled by kvm and will cause it to inject a GPF. Following patch adds a "nop" handler for this. Signed-off-by: Sergey Kiselev Signed-off-by: Avi Kivity commit 36809e1326c13887d324025d4592958ead8758d5 Author: Avi Kivity Date: Wed Mar 21 18:14:42 2007 +0200 KVM: Remove unused and write-only variables Trivial cleanup. Signed-off-by: Avi Kivity commit 262e17b818054dad314a062a439681d79a336d48 Author: Avi Kivity Date: Wed Mar 21 18:11:36 2007 +0200 KVM: Don't allow the guest to turn off the cpu cache The cpu cache is a host resource; the guest should not be able to turn it off (even for itself). Signed-off-by: Avi Kivity commit 8c37a70d93ba3e4286ad7524f7915a32ed39cac9 Author: Avi Kivity Date: Wed Mar 21 17:58:32 2007 +0200 KVM: Hack real-mode segments on vmx from KVM_SET_SREGS As usual, we need to mangle segment registers when emulating real mode as vm86 has specific constraints. We special case the reset segment base, and set the "access rights" (or descriptor flags) to vm86 comaptible values. This fixes reboot on vmx. Signed-off-by: Avi Kivity commit 0bf8d346418255335dc9062d96b9f8814b471690 Author: Avi Kivity Date: Wed Mar 21 13:44:58 2007 +0200 KVM: Modify guest segments after potentially switching modes The SET_SREGS ioctl modifies both cr0.pe (real mode/protected mode) and guest segment registers. Since segment handling is modified by the mode on Intel procesors, update the segment registers after the mode switch has taken place. Signed-off-by: Avi Kivity commit f97af70b3aa8a92ddeabb7d42477e7d13dd0a192 Author: Avi Kivity Date: Tue Mar 20 18:44:51 2007 +0200 KVM: Remove set_cr0_no_modeswitch() arch op set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers. As we now cache the protected mode values on entry to real mode, this isn't an issue anymore, and it interferes with reboot (which usually _is_ a modeswitch). Signed-off-by: Avi Kivity commit e314dde30e3851e8effc017c6fffced11d90183a Author: Avi Kivity Date: Tue Mar 20 18:40:40 2007 +0200 KVM: Workaround vmx inability to virtualize the reset state The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000, which aren't compatible with vm86 mode, which is used for real mode virtualization. When we create a vcpu, we set cs.base to 0xf0000, but if we get there by way of a reset, the values are inconsistent and vmx refuses to enter guest mode. Workaround by detecting the state and munging it appropriately. Signed-off-by: Avi Kivity commit 88aea7ddfae755633b0a80ccfa56244b3c79c7b0 Author: Avi Kivity Date: Tue Mar 20 14:34:28 2007 +0200 KVM: MMU: Remove global pte tracking The initial, noncaching, version of the kvm mmu flushed the all nonglobal shadow page table translations (much like a native tlb flush). The new implementation flushes translations only when they change, rendering global pte tracking superfluous. This removes the unused tracking mechanism and storage space. Signed-off-by: Avi Kivity commit 66e5d5c81b5b89e39aa86e3bf9864d228f468b0d Author: Avi Kivity Date: Tue Mar 20 14:29:06 2007 +0200 KVM: MMU: Remove unnecessary check for pdptr access We already special case the pdptr access, so no need to check it again. Signed-off-by: Avi Kivity commit c01571ed56754dfea458cc37d553c360082411a1 Author: Avi Kivity Date: Tue Mar 20 12:46:50 2007 +0200 KVM: Avoid guest virtual addresses in string pio userspace interface The current string pio interface communicates using guest virtual addresses, relying on userspace to translate addresses and to check permissions. This interface cannot fully support guest smp, as the check needs to take into account two pages at one in case an unaligned string transfer straddles a page boundary. Change the interface not to communicate guest addresses at all; instead use a buffer page (mmaped by userspace) and do transfers there. The kernel manages the virtual to physical translation and can perform the checks atomically by taking the appropriate locks. Signed-off-by: Avi Kivity commit 74c24de6e7848a45d6109d987d4fd2ccd83e432e Author: Avi Kivity Date: Wed Mar 7 13:11:17 2007 +0200 KVM: Future-proof argument-less ioctls Some ioctls ignore their arguments. By requiring them to be zero now, we allow a nonzero value to have some special meaning in the future. Signed-off-by: Avi Kivity commit 29e686a1dc9631b7898d087a0ab1c4716672e209 Author: Avi Kivity Date: Wed Mar 7 13:05:38 2007 +0200 KVM: Allow kernel to select size of mmap() buffer This allows us to store offsets in the kernel/user kvm_run area, and be sure that userspace has them mapped. As offsets can be outside the kvm_run struct, userspace has no way of knowing how much to mmap. Signed-off-by: Avi Kivity commit cce3a1062817218c67163732339e2ea25e9f023b Author: Avi Kivity Date: Mon Mar 5 19:46:05 2007 +0200 KVM: Add guest mode signal mask Allow a special signal mask to be used while executing in guest mode. This allows signals to be used to interrupt a vcpu without requiring signal delivery to a userspace handler, which is quite expensive. Userspace still receives -EINTR and can get the signal via sigwait(). Signed-off-by: Avi Kivity commit cd3aaa2392baec9674792d71d304ec41e540b517 Author: Avi Kivity Date: Mon Mar 5 17:45:40 2007 +0200 KVM: Initialize the apic_base msr on svm too Older userspace didn't care, but newer userspace (with the cpuid changes) does. Signed-off-by: Avi Kivity commit c303c0efc5b2ff8c0f77c9079fa66f62801da93d Author: Avi Kivity Date: Sun Mar 4 14:24:03 2007 +0200 KVM: Add a special exit reason when exiting due to an interrupt This is redundant, as we also return -EINTR from the ioctl, but it allows us to examine the exit_reason field on resume without seeing old data. Signed-off-by: Avi Kivity commit 62919332e00e3226dd1f728ff83107d06a6d9a81 Author: Avi Kivity Date: Sun Mar 4 14:17:08 2007 +0200 KVM: Fold kvm_run::exit_type into kvm_run::exit_reason Currently, userspace is told about the nature of the last exit from the guest using two fields, exit_type and exit_reason, where exit_type has just two enumerations (and no need for more). So fold exit_type into exit_reason, reducing the complexity of determining what really happened. Signed-off-by: Avi Kivity commit 9e16898f4f5d6cdc35030bb272631611b71548fe Author: Avi Kivity Date: Sun Mar 4 13:59:30 2007 +0200 KVM: Allow userspace to process hypercalls which have no kernel handler This is useful for paravirtualized graphics devices, for example. Signed-off-by: Avi Kivity commit 440fd9098bceb2ca0856d962ff62db9af4d1094a Author: Avi Kivity Date: Thu Mar 1 17:56:20 2007 +0200 KVM: Add method to check for backwards-compatible API extensions Signed-off-by: Avi Kivity commit 0b37dedb178bcb3b0a28f65e6ae835bf58184301 Author: Avi Kivity Date: Thu Mar 1 17:20:13 2007 +0200 KVM: Renumber ioctls The recent changes have left the ioctl numbers in complete disarray. Signed-off-by: Avi Kivity commit 95cab16b18e1c1a786a9fc5ea6fcd68b29ae3481 Author: Avi Kivity Date: Thu Mar 1 16:47:06 2007 +0200 KVM: Remove minor wart from KVM_CREATE_VCPU ioctl That ioctl does not transfer any data, so it should be an _IO rather than an _IOW. Signed-off-by: Avi Kivity commit ba5cb15b027b76ba7b4d247914eb6d20065c0767 Author: Avi Kivity Date: Thu Mar 1 16:20:40 2007 +0200 KVM: Remove the 'emulated' field from the userspace interface We no longer emulate single instructions in userspace. Instead, we service mmio or pio requests. Signed-off-by: Avi Kivity commit 706e8fe655be36aa686f1fbb398d3a4470d4939b Author: Avi Kivity Date: Wed Feb 28 20:46:53 2007 +0200 KVM: Handle cpuid in the kernel instead of punting to userspace KVM used to handle cpuid by letting userspace decide what values to return to the guest. We now handle cpuid completely in the kernel. We still let userspace decide which values the guest will see by having userspace set up the value table beforehand (this is necessary to allow management software to set the cpu features to the least common denominator, so that live migration can work). The motivation for the change is that kvm kernel code can be impacted by cpuid features, for example the x86 emulator. Signed-off-by: Avi Kivity commit aad2f6e0faf4b03e087bbe6751acdacd72e911b6 Author: Avi Kivity Date: Thu Feb 22 19:48:43 2007 +0200 KVM: Initialize PIO I/O count This allows userspace to ignore the io.rep field. No a big deal, but friendly. Signed-off-by: Avi Kivity commit e668cf946ee8654c7f5afe3feeed686a3566c22a Author: Avi Kivity Date: Thu Feb 22 19:39:30 2007 +0200 KVM: Do not communicate to userspace through cpu registers during PIO Currently when passing the a PIO emulation request to userspace, we rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx (on string instructions). This (a) requires two extra ioctls for getting and setting the registers and (b) is unfriendly to non-x86 archs, when they get kvm ports. So fix by doing the register fixups in the kernel and passing to userspace only an abstract description of the PIO to be done. Signed-off-by: Avi Kivity commit 3de857cd1335bd2e02b60d3a50b7da93ccbabf1d Author: Avi Kivity Date: Thu Feb 22 12:58:31 2007 +0200 KVM: Use a shared page for kernel/user communication when runing a vcpu Instead of passing a 'struct kvm_run' back and forth between the kernel and userspace, allocate a page and allow the user to mmap() it. This reduces needless copying and makes the interface expandable by providing lots of free space. Signed-off-by: Avi Kivity commit 128e159e11e999496ec44a549fcac91de3802389 Author: Avi Kivity Date: Mon Mar 19 13:18:10 2007 +0200 KVM: Prevent system selectors leaking into guest on real->protected mode transition on vmx Intel virtualization extensions do not support virtualizing real mode. So kvm uses virtualized vm86 mode to run real mode code. Unfortunately, this virtualized vm86 mode does not support the so called "big real" mode, where the segment selector and base do not agree with each other according to the real mode rules (base == selector << 4). To work around this, kvm checks whether a selector/base pair violates the virtualized vm86 rules, and if so, forces it into conformance. On a transition back to protected mode, if we see that the guest did not touch a forced segment, we restore it back to the original protected mode value. This pile of hacks breaks down if the gdt has changed in real mode, as it can cause a segment selector to point to a system descriptor instead of a normal data segment. In fact, this happens with the Windows bootloader and the qemu acpi bios, where a protected mode memcpy routine issues an innocent 'pop %es' and traps on an attempt to load a system descriptor. "Fix" by checking if the to-be-restored selector points at a system segment, and if so, coercing it into a normal data segment. The long term solution, of course, is to abandon vm86 mode and use emulation for big real mode. Signed-off-by: Avi Kivity commit ade11a015f83d270d1201c440199146f852fe5e4 Author: Uri Lublin Date: Wed Mar 14 19:21:06 2007 +0200 added KVM_GET_MEM_MAP ioctl to get the memory bitmap for a memory slot To be used when there may be "holes" in the memory. Specifically to not break VM migration when ballooning mechanism exists Signed-off-by: Uri Lublin commit b0092d187cfa19dfcada3b85d728af5ae27989dc Author: Avi Kivity Date: Wed Mar 14 15:54:54 2007 +0200 KVM: Remove extraneous guest entry on mmio read When emulating an mmio read, we actually emulate twice: once to determine the physical address of the mmio, and, after we've exited to userspace to get the mmio value, we emulate again to place the value in the result register and update any flags. But we don't really need to enter the guest again for that, only to take an immediate vmexit. So, if we detect that we're doing an mmio read, emulate a single instruction before entering the guest again. Signed-off-by: Avi Kivity commit 470db88b8b3491199e8d55b771d66e74b2fd53cd Author: Ingo Molnar Date: Sun Mar 11 13:52:33 2007 +0100 KVM: always reload segment selectors failed VM entry on VMX might still change %fs or %gs, thus make sure that KVM always reloads the segment selectors. This is crutial on both x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work. Signed-off-by: Ingo Molnar commit f7edc6a39584a3f95687a5320675fadb23bccbe5 Author: Ingo Molnar Date: Sat Mar 10 11:22:51 2007 +0100 KVM: trivial whitespace fixes trivial whitespace fixes. Signed-off-by: Ingo Molnar commit f3a33bfeaa5cade1a9ac1facb5cb904a483b1e5c Author: Avi Kivity Date: Fri Mar 9 13:04:31 2007 +0200 KVM: MMU: Fix host memory corruption on i386 with >= 4GB ram PAGE_MASK is an unsigned long, so using it to mask physical addresses on i386 (which are 64-bit wide) leads to truncation. This can result in page->private of unrelated memory pages being modified, with disasterous results. Fix by not using PAGE_MASK for physical addresses; instead calculate the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON(). Signed-off-by: Avi Kivity commit 6ee9853b015f8807f497ffad39b142ddc1403aa9 Author: Avi Kivity Date: Thu Mar 8 17:13:32 2007 +0200 KVM: MMU: Fix guest writes to nonpae pde KVM shadow page tables are always in pae mode, regardless of the guest setting. This means that a guest pde (mapping 4MB of memory) is mapped to two shadow pdes (mapping 2MB each). When the guest writes to a pte or pde, we intercept the write and emulate it. We also remove any shadowed mappings corresponding to the write. Since the mmu did not account for the doubling in the number of pdes, it removed the wrong entry, resulting in a mismatch between shadow page tables and guest page tables, followed shortly by guest memory corruption. This patch fixes the problem by detecting the special case of writing to a non-pae pde and adjusting the address and number of shadow pdes zapped accordingly. Signed-off-by: Avi Kivity commit 374c1509c7d04a4e351b1812c2f0b9dac3ea0c0a Author: Avi Kivity Date: Thu Mar 8 11:48:09 2007 +0200 KVM: Fix bogus sign extension in mmu mapping audit When auditing a 32-bit guest on a 64-bit host, sign extension of the page table directory pointer table index caused bogus addresses to be shown on audit errors. Fix by declaring the index unsigned. Signed-off-by: Avi Kivity commit fac539542cbf923a39238b10557c88f99fd45b59 Author: Avi Kivity Date: Wed Mar 7 09:29:48 2007 +0200 KVM: Export This allows users to actually build prgrams that use kvm without the entire source tree. Signed-off-by: Avi Kivity commit c14a46343cc9f04f15ebc67573031fe8bbe1555a Author: Avi Kivity Date: Tue Mar 6 12:05:53 2007 +0200 KVM: Fix guest sysenter on vmx The vmx code currently treats the guest's sysenter support msrs as 32-bit values, which breaks 32-bit compat mode userspace on 64-bit guests. Fix by using the native word width of the machine. Signed-off-by: Avi Kivity commit ea135e7671189ffb7e67843bf98740dac0c6ccfa Author: Avi Kivity Date: Sun Mar 4 13:27:36 2007 +0200 KVM: Use own minor number Use the minor number (232) allocated to kvm by lanana. Signed-off-by: Avi Kivity commit 21af17507f37658414191b1cf1337efbaf7dd530 Author: Dor Laor Date: Mon Feb 19 18:25:43 2007 +0200 KVM: Use the generic skip_emulated_instruction() in hypercall code Instead of twiddling the rip registers directly, use the skip_emulated_instruction() function to do that for us. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 57d78025d84fb607aa335d015a79b257517aa209 Author: Dor Laor Date: Mon Feb 19 16:44:49 2007 +0200 KVM: Fix guest register corruption on paravirt hypercall The hypercall code mixes up the ->cache_regs() and ->decache_regs() callbacks, resulting in guest register corruption. Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 28e9803c9134683a884efe05abdb3f814c1ca7e7 Author: Avi Kivity Date: Thu Mar 1 19:21:03 2007 +0200 KVM: Unset kvm_arch_ops if arch module loading failed Otherwise, the core module thinks the arch module is loaded, and won't let you reload it after you've fixed the bug. Signed-off-by: Avi Kivity commit 426bc2fd1462706ec92d0e9efdb0cf3643f4eb67 Author: Avi Kivity Date: Thu Mar 1 11:28:13 2007 +0200 KVM: Move kvmfs magic number to From: Andrew Morton Use the standard magic.h for kvmfs. Cc: Avi Kivity Signed-off-by: Andrew Morton Signed-off-by: Avi Kivity commit c1a8557e1da6e7d8bf8f77cb1b47c077f5c2a67d Author: Avi Kivity Date: Mon Feb 26 16:29:43 2007 +0200 KVM: Fix bogus failure in kvm.ko module initialization A bogus 'return r' can cause an otherwise successful module load to fail. This both denies users the use of kvm, and it also denies them the use of their machine, as it leaves a filesystem registered with its callbacks pointing into now-freed module memory. Fix by returning a zero like a good module. Thanks to Richard Lucassen (?) for reporting the problem and for providing access to a machine which exhibited it. Signed-off-by: Avi Kivity commit 7703ff91ee2ed171f2175d030e7f063c4efab2f5 Author: Uri Lublin Date: Thu Feb 22 17:37:32 2007 +0200 KVM: Remove write access permissions when dirty-page-logging is enabled Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl. If the memory region already exists, we need to remove write accesses, so writes will be caught, and dirty pages will be logged. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit b77fd1f62576463434fc434cbdcd808847e169a1 Author: Uri Lublin Date: Thu Feb 22 17:15:33 2007 +0200 kvm: move do_remove_write_access() up To be called from kvm_vm_ioctl_set_memory_region() Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 62e287e7210d6ff142b3b05233fa1f5df686b794 Author: Uri Lublin Date: Thu Feb 22 16:43:09 2007 +0200 KVM: Fix dirty page log bitmap size/access calculation Since dirty_bitmap is an unsigned long array, the alignment and size need to take that into account. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 871574eb14e959c19d94fdee7c3e2b88ae06770f Author: Uri Lublin Date: Wed Feb 21 18:25:21 2007 +0200 KVM: Add missing calls to mark_page_dirty() A few places where we modify guest memory fail to call mark_page_dirty(), causing live migration to fail. This adds the missing calls. Signed-off-by: Uri Lublin Signed-off-by: Avi Kivity commit 42017e8bf8eb7b6f65b95bca1368ee274fc5ef50 Author: Uri Lublin Date: Thu Feb 22 17:37:32 2007 +0200 kvm: dirty page logging: remove write access permissions when dirty-page-logging is enabled Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl. If the memory region already exists, there is a need to remove write accesses, so writes will be caught, and dirty pages will be logged. commit a9fd29cfcb643b97cd76c7d836be4d0ed80f69e0 Author: Uri Lublin Date: Thu Feb 22 17:15:33 2007 +0200 kvm: move do_remove_write_access() up To be called from kvm_vm_ioctl_set_memory_region() commit fba4ba9c513ad2cd328f5f16980aa7b90d40cec0 Author: Uri Lublin Date: Thu Feb 22 16:43:09 2007 +0200 kvm: dirty pages log: fix bitmap size/access calculation Since dirty_bitmap is an unsigned long array (pointer) commit ae160d732685ab33d5a3a495663aa2b54c4d4734 Author: Uri Lublin Date: Thu Feb 22 15:47:42 2007 +0200 .gitignore: ignore emacs backup files (*~) commit 8267c1cd9a8a038e91c94e0cabc571a3614dc3e5 Author: Avi Kivity Date: Wed Feb 21 19:47:40 2007 +0200 KVM: Bump API version Signed-off-by: Avi Kivity commit c65237e78c19b8173338a49933c611dece13c1c6 Author: Avi Kivity Date: Wed Feb 21 18:04:26 2007 +0200 KVM: Per-vcpu inodes Allocate a distinct inode for every vcpu in a VM. This has the following benefits: - the filp cachelines are no longer bounced when f_count is incremented on every ioctl() - the API and internal code are distinctly clearer; for example, on the KVM_GET_REGS ioctl, there is no need to copy the vcpu number from userspace and then copy the registers back; the vcpu identity is derived from the fd used to make the call Right now the performance benefits are completely theoretical since (a) we don't support more than one vcpu per VM and (b) virtualization hardware inefficiencies completely everwhelm any cacheline bouncing effects. But both of these will change, and we need to prepare the API today. Signed-off-by: Avi Kivity commit 11c1297fadc533d1f66252088b4f4775018bafbb Author: Avi Kivity Date: Tue Feb 20 18:41:05 2007 +0200 KVM: Move kvm_vm_ioctl_create_vcpu() around In preparation of some hacking. Signed-off-by: Avi Kivity commit f3ad84386727171d8308338a2c5dee1deac2e50d Author: Avi Kivity Date: Tue Feb 20 18:27:58 2007 +0200 KVM: Rename some kvm_dev_ioctl_*() functions to kvm_vm_ioctl_*() This reflects the changed scope, from device-wide to single vm (previously every device open created a virtual machine). Signed-off-by: Avi Kivity commit 733e3f74f1c51bbc2e7a99df8b51767504b58de2 Author: Avi Kivity Date: Wed Feb 21 19:28:04 2007 +0200 KVM: Create an inode per virtual machine This avoids having filp->f_op and the corresponding inode->i_fop different, which is a little unorthodox. The ioctl list is split into two: global kvm ioctls and per-vm ioctls. A new ioctl, KVM_CREATE_VM, is used to create VMs and return the VM fd. Signed-off-by: Avi Kivity commit 52a96114380f8ab615626e4cec57b7015895bd0f Author: Avi Kivity Date: Tue Feb 20 14:07:37 2007 +0200 KVM: Add internal filesystem for generating inodes The kvmfs inodes will represent virtual machines and vcpus, as necessary, reducing cacheline bouncing due to inodes and filps being shared. Signed-off-by: Avi Kivity commit b00bc8b10197715f5b842f1f9a60e67a3484b10f Author: Uri Lublin Date: Wed Feb 21 18:25:21 2007 +0200 kvm, dirty pages log: adding some calls to mark_page_dirty() commit 58a214eba321d92f833221c26777e2119e34a19d Author: Avi Kivity Date: Mon Feb 19 14:37:48 2007 +0200 KVM: More 0 -> NULL conversions Signed-off-by: Avi Kivity commit f73199bb57b4c8feb7d8f60c6f1a25107de18dab Author: Joerg Roedel Date: Mon Feb 19 14:37:47 2007 +0200 KVM: SVM: intercept SMI to handle it at host level This patch changes the SVM code to intercept SMIs and handle it outside the guest. Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit fa2742c78f10fad8682e3af17df3e9fc2eece9e4 Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: svm: init cr0 with the wp bit set Signed-off-by: Avi Kivity commit 8da588a919dc0bef76e384d16fd13ea2189aa82d Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Wire up hypercall handlers to a central arch-independent location Signed-off-by: Avi Kivity commit 68f16784f188d280c75b39e2367ebc1adbc66d9d Author: Avi Kivity Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Add hypercall host support for svm Signed-off-by: Avi Kivity commit 7c8bd4d6fc0e2bfb35cd4c0e8ff39c4f8972d951 Author: Ingo Molnar Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Add host hypercall support for vmx Signed-off-by: Avi Kivity commit f846fa34a14ec37dc0194c6f47ea4374c140e6f1 Author: Ingo Molnar Date: Mon Feb 19 14:37:47 2007 +0200 KVM: add MSR based hypercall API This adds a special MSR based hypercall API to KVM. This is to be used by paravirtual kernels and virtual drivers. Signed-off-by: Ingo Molnar Signed-off-by: Avi Kivity commit 8aa04bb13cf90d68c26d6bea1e4c720f1f027be0 Author: Markus Rechberger Date: Mon Feb 19 14:37:47 2007 +0200 KVM: Use page_private()/set_page_private() apis Besides using an established api, this allows using kvm in older kernels. Signed-off-by: Markus Rechberger Signed-off-by: Avi Kivity commit 4d5a7e81cc63d28e94373cdeb74dc44045edaa10 Author: Ahmed S. Darwish Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Use ARRAY_SIZE macro instead of manual calculation. Signed-off-by: Ahmed S. Darwish Signed-off-by: Dor Laor Signed-off-by: Avi Kivity commit 0fe9875fb3f9946a6c1cef6f1b9a286edc8ee2b9 Author: Markus Rechberger Date: Mon Feb 19 14:37:46 2007 +0200 KVM: vmx: hack set_cr0_no_modeswitch() to actually do modeswitch From: Joerg Roedel The whole thing is rotten, but this allows vmx to boot with the guest reboot fix. Signed-off-by: Markus Rechberger Signed-off-by: Joerg Roedel Signed-off-by: Avi Kivity commit 7e6e2bbad7f5dbccb389ee6d79be661972b18b15 Author: Avi Kivity Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Cosmetics Signed-off-by: Avi Kivity commit cc66daca849ca8c2900ba8cc7640de664296d36a Author: Jeremy Katz Date: Mon Feb 19 14:37:46 2007 +0200 KVM: Move virtualization deactivation from CPU_DEAD state to CPU_DOWN_PREPARE This gives it more chances of surviving suspend. Signed-off-by: Jeremy Katz Signed-off-by: Avi Kivity commit 2959cd13ecc1fbe1b2339937481844ff963f1e7f Author: Avi Kivity Date: Mon Feb 19 14:37:46 2007 +0200 KVM: mmu: add missing dirty page tracking cases We fail to mark a page dirty in three cases: - setting the accessed bit in a pte - setting the dirty bit in a pte - emulating a write into a pagetable This fix adds the missing cases. Signed-off-by: Avi Kivity .gitignore | 3 drivers/kvm/Kconfig | 8 - drivers/kvm/kvm.h | 45 +++ drivers/kvm/kvm_main.c | 77 +++++- drivers/kvm/mmu.c | 276 +++++++++------------ drivers/kvm/paging_tmpl.h | 273 +++++++++++---------- drivers/kvm/svm.c | 12 + drivers/kvm/vmx.c | 599 ++++++++++++++++++++++++++++----------------- drivers/kvm/x86_emulate.c | 4 9 files changed, 764 insertions(+), 533 deletions(-) diff --git a/.gitignore b/.gitignore index 060a71d..343c716 100644 --- a/.gitignore +++ b/.gitignore @@ -45,3 +45,6 @@ series # cscope files cscope.* + +# emacs backup files +*~ diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig index e8e37d8..2f661e5 100644 --- a/drivers/kvm/Kconfig +++ b/drivers/kvm/Kconfig @@ -1,8 +1,12 @@ # # KVM configuration # -menu "Virtualization" +menuconfig VIRTUALIZATION + bool "Virtualization" depends on X86 + default y + +if VIRTUALIZATION config KVM tristate "Kernel-based Virtual Machine (KVM) support" @@ -35,4 +39,4 @@ config KVM_AMD Provides support for KVM on AMD processors equipped with the AMD-V (SVM) extensions. -endmenu +endif # VIRTUALIZATION diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 1c040d8..e665f55 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -10,6 +10,8 @@ #include #include #include #include +#include +#include #include #include @@ -18,6 +20,7 @@ #include #include #define CR0_PE_MASK (1ULL << 0) +#define CR0_MP_MASK (1ULL << 1) #define CR0_TS_MASK (1ULL << 3) #define CR0_NE_MASK (1ULL << 5) #define CR0_WP_MASK (1ULL << 16) @@ -42,7 +45,8 @@ #define KVM_GUEST_CR0_MASK \ (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK \ | CR0_NW_MASK | CR0_CD_MASK) #define KVM_VM_CR0_ALWAYS_ON \ - (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK) + (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK | CR0_TS_MASK \ + | CR0_MP_MASK) #define KVM_GUEST_CR4_MASK \ (CR4_PSE_MASK | CR4_PAE_MASK | CR4_PGE_MASK | CR4_VMXE_MASK | CR4_VME_MASK) #define KVM_PMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK) @@ -51,10 +55,10 @@ #define KVM_RMODE_VM_CR4_ALWAYS_ON (CR4_ #define INVALID_PAGE (~(hpa_t)0) #define UNMAPPED_GVA (~(gpa_t)0) -#define KVM_MAX_VCPUS 1 +#define KVM_MAX_VCPUS 4 #define KVM_ALIAS_SLOTS 4 #define KVM_MEMORY_SLOTS 4 -#define KVM_NUM_MMU_PAGES 256 +#define KVM_NUM_MMU_PAGES 1024 #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 40 @@ -137,7 +141,7 @@ struct kvm_mmu_page { gfn_t gfn; union kvm_mmu_page_role role; - hpa_t page_hpa; + u64 *spt; unsigned long slot_bitmap; /* One bit set per slot which has memory * in this shadow page. */ @@ -252,6 +256,8 @@ struct kvm_stat { u32 halt_exits; u32 request_irq_exits; u32 irq_exits; + u32 light_exits; + u32 efer_reload; }; struct kvm_vcpu { @@ -285,15 +291,20 @@ #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE u64 apic_base; u64 ia32_misc_enable_msr; int nmsrs; + int save_nmsrs; + int msr_offset_efer; +#ifdef CONFIG_X86_64 + int msr_offset_kernel_gs_base; +#endif struct vmx_msr_entry *guest_msrs; struct vmx_msr_entry *host_msrs; - struct list_head free_pages; - struct kvm_mmu_page page_header_buf[KVM_NUM_MMU_PAGES]; struct kvm_mmu mmu; struct kvm_mmu_memory_cache mmu_pte_chain_cache; struct kvm_mmu_memory_cache mmu_rmap_desc_cache; + struct kvm_mmu_memory_cache mmu_page_cache; + struct kvm_mmu_memory_cache mmu_page_header_cache; gfn_t last_pt_write_gfn; int last_pt_write_count; @@ -304,6 +315,12 @@ #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE char *host_fx_image; char *guest_fx_image; int fpu_active; + int guest_fpu_loaded; + struct vmx_host_state { + int loaded; + u16 fs_sel, gs_sel, ldt_sel; + int fs_gs_ldt_reload_needed; + } vmx_host_state; int mmio_needed; int mmio_read_completed; @@ -508,6 +525,8 @@ void fx_init(struct kvm_vcpu *vcpu); void load_msrs(struct vmx_msr_entry *e, int n); void save_msrs(struct vmx_msr_entry *e, int n); void kvm_resched(struct kvm_vcpu *vcpu); +void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); +void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); int kvm_read_guest(struct kvm_vcpu *vcpu, gva_t addr, @@ -521,10 +540,12 @@ int kvm_write_guest(struct kvm_vcpu *vcp unsigned long segment_base(u16 selector); -void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes); -void kvm_mmu_post_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes); +void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, + const u8 *old, const u8 *new, int bytes); int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); void kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); +int kvm_mmu_load(struct kvm_vcpu *vcpu); +void kvm_mmu_unload(struct kvm_vcpu *vcpu); int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); @@ -536,6 +557,14 @@ static inline int kvm_mmu_page_fault(str return vcpu->mmu.page_fault(vcpu, gva, error_code); } +static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) +{ + if (likely(vcpu->mmu.root_hpa != INVALID_PAGE)) + return 0; + + return kvm_mmu_load(vcpu); +} + static inline int is_long_mode(struct kvm_vcpu *vcpu) { #ifdef CONFIG_X86_64 diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index da985b3..230b25a 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -72,6 +72,8 @@ static struct kvm_stats_debugfs_item { { "halt_exits", STAT_OFFSET(halt_exits) }, { "request_irq", STAT_OFFSET(request_irq_exits) }, { "irq_exits", STAT_OFFSET(irq_exits) }, + { "light_exits", STAT_OFFSET(light_exits) }, + { "efer_reload", STAT_OFFSET(efer_reload) }, { NULL } }; @@ -253,6 +255,28 @@ int kvm_write_guest(struct kvm_vcpu *vcp } EXPORT_SYMBOL_GPL(kvm_write_guest); +void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) +{ + if (!vcpu->fpu_active || vcpu->guest_fpu_loaded) + return; + + vcpu->guest_fpu_loaded = 1; + fx_save(vcpu->host_fx_image); + fx_restore(vcpu->guest_fx_image); +} +EXPORT_SYMBOL_GPL(kvm_load_guest_fpu); + +void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) +{ + if (!vcpu->guest_fpu_loaded) + return; + + vcpu->guest_fpu_loaded = 0; + fx_save(vcpu->guest_fx_image); + fx_restore(vcpu->host_fx_image); +} +EXPORT_SYMBOL_GPL(kvm_put_guest_fpu); + /* * Switches to specified vcpu, until a matching vcpu_put() */ @@ -295,6 +319,9 @@ static struct kvm *kvm_create_vm(void) spin_lock_init(&kvm->lock); INIT_LIST_HEAD(&kvm->active_mmu_pages); + spin_lock(&kvm_lock); + list_add(&kvm->vm_list, &vm_list); + spin_unlock(&kvm_lock); for (i = 0; i < KVM_MAX_VCPUS; ++i) { struct kvm_vcpu *vcpu = &kvm->vcpus[i]; @@ -302,10 +329,6 @@ static struct kvm *kvm_create_vm(void) vcpu->cpu = -1; vcpu->kvm = kvm; vcpu->mmu.root_hpa = INVALID_PAGE; - INIT_LIST_HEAD(&vcpu->free_pages); - spin_lock(&kvm_lock); - list_add(&kvm->vm_list, &vm_list); - spin_unlock(&kvm_lock); } return kvm; } @@ -358,6 +381,16 @@ static void free_pio_guest_pages(struct } } +static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu) +{ + if (!vcpu->vmcs) + return; + + vcpu_load(vcpu); + kvm_mmu_unload(vcpu); + vcpu_put(vcpu); +} + static void kvm_free_vcpu(struct kvm_vcpu *vcpu) { if (!vcpu->vmcs) @@ -378,6 +411,11 @@ static void kvm_free_vcpus(struct kvm *k { unsigned int i; + /* + * Unpin any mmu pages first. + */ + for (i = 0; i < KVM_MAX_VCPUS; ++i) + kvm_unload_vcpu_mmu(&kvm->vcpus[i]); for (i = 0; i < KVM_MAX_VCPUS; ++i) kvm_free_vcpu(&kvm->vcpus[i]); } @@ -947,7 +985,7 @@ EXPORT_SYMBOL_GPL(gfn_to_page); void mark_page_dirty(struct kvm *kvm, gfn_t gfn) { int i; - struct kvm_memory_slot *memslot = NULL; + struct kvm_memory_slot *memslot; unsigned long rel_gfn; for (i = 0; i < kvm->nmemslots; ++i) { @@ -956,7 +994,7 @@ void mark_page_dirty(struct kvm *kvm, gf if (gfn >= memslot->base_gfn && gfn < memslot->base_gfn + memslot->npages) { - if (!memslot || !memslot->dirty_bitmap) + if (!memslot->dirty_bitmap) return; rel_gfn = gfn - memslot->base_gfn; @@ -1048,18 +1086,18 @@ static int emulator_write_phys(struct kv { struct page *page; void *virt; + unsigned offset = offset_in_page(gpa); if (((gpa + bytes - 1) >> PAGE_SHIFT) != (gpa >> PAGE_SHIFT)) return 0; page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT); if (!page) return 0; - kvm_mmu_pre_write(vcpu, gpa, bytes); mark_page_dirty(vcpu->kvm, gpa >> PAGE_SHIFT); virt = kmap_atomic(page, KM_USER0); + kvm_mmu_pte_write(vcpu, gpa, virt + offset, val, bytes); memcpy(virt + offset_in_page(gpa), val, bytes); kunmap_atomic(virt, KM_USER0); - kvm_mmu_post_write(vcpu, gpa, bytes); return 1; } @@ -1447,6 +1485,7 @@ int kvm_get_msr_common(struct kvm_vcpu * case MSR_IA32_MC0_MISC+16: case MSR_IA32_UCODE_REV: case MSR_IA32_PERF_STATUS: + case MSR_IA32_EBL_CR_POWERON: /* MTRR registers */ case 0xfe: case 0x200 ... 0x2ff: @@ -2354,6 +2393,27 @@ out: return r; } +static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu) +{ + u64 efer; + int i; + struct kvm_cpuid_entry *e, *entry; + + rdmsrl(MSR_EFER, efer); + entry = NULL; + for (i = 0; i < vcpu->cpuid_nent; ++i) { + e = &vcpu->cpuid_entries[i]; + if (e->function == 0x80000001) { + entry = e; + break; + } + } + if (entry && (entry->edx & EFER_NX) && !(efer & EFER_NX)) { + entry->edx &= ~(1 << 20); + printk(KERN_INFO ": guest NX capability removed\n"); + } +} + static int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid *cpuid, struct kvm_cpuid_entry __user *entries) @@ -2368,6 +2428,7 @@ static int kvm_vcpu_ioctl_set_cpuid(stru cpuid->nent * sizeof(struct kvm_cpuid_entry))) goto out; vcpu->cpuid_nent = cpuid->nent; + cpuid_fix_nx_cap(vcpu); return 0; out: diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c index e8e2281..d4de988 100644 --- a/drivers/kvm/mmu.c +++ b/drivers/kvm/mmu.c @@ -22,6 +22,7 @@ #include #include #include #include +#include #include "vmx.h" #include "kvm.h" @@ -90,25 +91,11 @@ #define PT32_DIR_PSE36_SHIFT 13 #define PT32_DIR_PSE36_MASK (((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT) -#define PT32_PTE_COPY_MASK \ - (PT_PRESENT_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK | PT_GLOBAL_MASK) - -#define PT64_PTE_COPY_MASK (PT64_NX_MASK | PT32_PTE_COPY_MASK) - #define PT_FIRST_AVAIL_BITS_SHIFT 9 #define PT64_SECOND_AVAIL_BITS_SHIFT 52 -#define PT_SHADOW_PS_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) #define PT_SHADOW_IO_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) -#define PT_SHADOW_WRITABLE_SHIFT (PT_FIRST_AVAIL_BITS_SHIFT + 1) -#define PT_SHADOW_WRITABLE_MASK (1ULL << PT_SHADOW_WRITABLE_SHIFT) - -#define PT_SHADOW_USER_SHIFT (PT_SHADOW_WRITABLE_SHIFT + 1) -#define PT_SHADOW_USER_MASK (1ULL << (PT_SHADOW_USER_SHIFT)) - -#define PT_SHADOW_BITS_OFFSET (PT_SHADOW_WRITABLE_SHIFT - PT_WRITABLE_SHIFT) - #define VALID_PAGE(x) ((x) != INVALID_PAGE) #define PT64_LEVEL_BITS 9 @@ -165,6 +152,8 @@ struct kvm_rmap_desc { static struct kmem_cache *pte_chain_cache; static struct kmem_cache *rmap_desc_cache; +static struct kmem_cache *mmu_page_cache; +static struct kmem_cache *mmu_page_header_cache; static int is_write_protection(struct kvm_vcpu *vcpu) { @@ -202,6 +191,15 @@ static int is_rmap_pte(u64 pte) == (PT_WRITABLE_MASK | PT_PRESENT_MASK); } +static void set_shadow_pte(u64 *sptep, u64 spte) +{ +#ifdef CONFIG_X86_64 + set_64bit((unsigned long *)sptep, spte); +#else + set_64bit((unsigned long long *)sptep, spte); +#endif +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, struct kmem_cache *base_cache, int min, gfp_t gfp_flags) @@ -235,6 +233,14 @@ static int __mmu_topup_memory_caches(str goto out; r = mmu_topup_memory_cache(&vcpu->mmu_rmap_desc_cache, rmap_desc_cache, 1, gfp_flags); + if (r) + goto out; + r = mmu_topup_memory_cache(&vcpu->mmu_page_cache, + mmu_page_cache, 4, gfp_flags); + if (r) + goto out; + r = mmu_topup_memory_cache(&vcpu->mmu_page_header_cache, + mmu_page_header_cache, 4, gfp_flags); out: return r; } @@ -258,6 +264,8 @@ static void mmu_free_memory_caches(struc { mmu_free_memory_cache(&vcpu->mmu_pte_chain_cache); mmu_free_memory_cache(&vcpu->mmu_rmap_desc_cache); + mmu_free_memory_cache(&vcpu->mmu_page_cache); + mmu_free_memory_cache(&vcpu->mmu_page_header_cache); } static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc, @@ -434,18 +442,17 @@ static void rmap_write_protect(struct kv rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte); rmap_remove(vcpu, spte); kvm_arch_ops->tlb_flush(vcpu); - *spte &= ~(u64)PT_WRITABLE_MASK; + set_shadow_pte(spte, *spte & ~PT_WRITABLE_MASK); } } #ifdef MMU_DEBUG -static int is_empty_shadow_page(hpa_t page_hpa) +static int is_empty_shadow_page(u64 *spt) { u64 *pos; u64 *end; - for (pos = __va(page_hpa), end = pos + PAGE_SIZE / sizeof(u64); - pos != end; pos++) + for (pos = spt, end = pos + PAGE_SIZE / sizeof(u64); pos != end; pos++) if (*pos != 0) { printk(KERN_ERR "%s: %p %llx\n", __FUNCTION__, pos, *pos); @@ -455,13 +462,13 @@ static int is_empty_shadow_page(hpa_t pa } #endif -static void kvm_mmu_free_page(struct kvm_vcpu *vcpu, hpa_t page_hpa) +static void kvm_mmu_free_page(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *page_head) { - struct kvm_mmu_page *page_head = page_header(page_hpa); - - ASSERT(is_empty_shadow_page(page_hpa)); - page_head->page_hpa = page_hpa; - list_move(&page_head->link, &vcpu->free_pages); + ASSERT(is_empty_shadow_page(page_head->spt)); + list_del(&page_head->link); + mmu_memory_cache_free(&vcpu->mmu_page_cache, page_head->spt); + mmu_memory_cache_free(&vcpu->mmu_page_header_cache, page_head); ++vcpu->kvm->n_free_mmu_pages; } @@ -475,12 +482,15 @@ static struct kvm_mmu_page *kvm_mmu_allo { struct kvm_mmu_page *page; - if (list_empty(&vcpu->free_pages)) + if (!vcpu->kvm->n_free_mmu_pages) return NULL; - page = list_entry(vcpu->free_pages.next, struct kvm_mmu_page, link); - list_move(&page->link, &vcpu->kvm->active_mmu_pages); - ASSERT(is_empty_shadow_page(page->page_hpa)); + page = mmu_memory_cache_alloc(&vcpu->mmu_page_header_cache, + sizeof *page); + page->spt = mmu_memory_cache_alloc(&vcpu->mmu_page_cache, PAGE_SIZE); + set_page_private(virt_to_page(page->spt), (unsigned long)page); + list_add(&page->link, &vcpu->kvm->active_mmu_pages); + ASSERT(is_empty_shadow_page(page->spt)); page->slot_bitmap = 0; page->multimapped = 0; page->parent_pte = parent_pte; @@ -638,7 +648,7 @@ static void kvm_mmu_page_unlink_children u64 *pt; u64 ent; - pt = __va(page->page_hpa); + pt = page->spt; if (page->role.level == PT_PAGE_TABLE_LEVEL) { for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { @@ -685,12 +695,12 @@ static void kvm_mmu_zap_page(struct kvm_ } BUG_ON(!parent_pte); kvm_mmu_put_page(vcpu, page, parent_pte); - *parent_pte = 0; + set_shadow_pte(parent_pte, 0); } kvm_mmu_page_unlink_children(vcpu, page); if (!page->root_count) { hlist_del(&page->hash_link); - kvm_mmu_free_page(vcpu, page->page_hpa); + kvm_mmu_free_page(vcpu, page); } else list_move(&page->link, &vcpu->kvm->active_mmu_pages); } @@ -717,6 +727,17 @@ static int kvm_mmu_unprotect_page(struct return r; } +static void mmu_unshadow(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + struct kvm_mmu_page *page; + + while ((page = kvm_mmu_lookup_page(vcpu, gfn)) != NULL) { + pgprintk("%s: zap %lx %x\n", + __FUNCTION__, gfn, page->role.word); + kvm_mmu_zap_page(vcpu, page); + } +} + static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa) { int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT)); @@ -805,7 +826,7 @@ static int nonpaging_map(struct kvm_vcpu return -ENOMEM; } - table[index] = new_table->page_hpa | PT_PRESENT_MASK + table[index] = __pa(new_table->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK; } table_addr = table[index] & PT64_BASE_ADDR_MASK; @@ -817,11 +838,12 @@ static void mmu_free_roots(struct kvm_vc int i; struct kvm_mmu_page *page; + if (!VALID_PAGE(vcpu->mmu.root_hpa)) + return; #ifdef CONFIG_X86_64 if (vcpu->mmu.shadow_root_level == PT64_ROOT_LEVEL) { hpa_t root = vcpu->mmu.root_hpa; - ASSERT(VALID_PAGE(root)); page = page_header(root); --page->root_count; vcpu->mmu.root_hpa = INVALID_PAGE; @@ -832,7 +854,6 @@ #endif hpa_t root = vcpu->mmu.pae_root[i]; if (root) { - ASSERT(VALID_PAGE(root)); root &= PT64_BASE_ADDR_MASK; page = page_header(root); --page->root_count; @@ -857,7 +878,7 @@ #ifdef CONFIG_X86_64 ASSERT(!VALID_PAGE(root)); page = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL, 0, 0, NULL); - root = page->page_hpa; + root = __pa(page->spt); ++page->root_count; vcpu->mmu.root_hpa = root; return; @@ -878,7 +899,7 @@ #endif page = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL, !is_paging(vcpu), 0, NULL); - root = page->page_hpa; + root = __pa(page->spt); ++page->root_count; vcpu->mmu.pae_root[i] = root | PT_PRESENT_MASK; } @@ -928,9 +949,7 @@ static int nonpaging_init_context(struct context->free = nonpaging_free; context->root_level = 0; context->shadow_root_level = PT32E_ROOT_LEVEL; - mmu_alloc_roots(vcpu); - ASSERT(VALID_PAGE(context->root_hpa)); - kvm_arch_ops->set_cr3(vcpu, context->root_hpa); + context->root_hpa = INVALID_PAGE; return 0; } @@ -944,59 +963,6 @@ static void paging_new_cr3(struct kvm_vc { pgprintk("%s: cr3 %lx\n", __FUNCTION__, vcpu->cr3); mmu_free_roots(vcpu); - if (unlikely(vcpu->kvm->n_free_mmu_pages < KVM_MIN_FREE_MMU_PAGES)) - kvm_mmu_free_some_pages(vcpu); - mmu_alloc_roots(vcpu); - kvm_mmu_flush_tlb(vcpu); - kvm_arch_ops->set_cr3(vcpu, vcpu->mmu.root_hpa); -} - -static inline void set_pte_common(struct kvm_vcpu *vcpu, - u64 *shadow_pte, - gpa_t gaddr, - int dirty, - u64 access_bits, - gfn_t gfn) -{ - hpa_t paddr; - - *shadow_pte |= access_bits << PT_SHADOW_BITS_OFFSET; - if (!dirty) - access_bits &= ~PT_WRITABLE_MASK; - - paddr = gpa_to_hpa(vcpu, gaddr & PT64_BASE_ADDR_MASK); - - *shadow_pte |= access_bits; - - if (is_error_hpa(paddr)) { - *shadow_pte |= gaddr; - *shadow_pte |= PT_SHADOW_IO_MARK; - *shadow_pte &= ~PT_PRESENT_MASK; - return; - } - - *shadow_pte |= paddr; - - if (access_bits & PT_WRITABLE_MASK) { - struct kvm_mmu_page *shadow; - - shadow = kvm_mmu_lookup_page(vcpu, gfn); - if (shadow) { - pgprintk("%s: found shadow page for %lx, marking ro\n", - __FUNCTION__, gfn); - access_bits &= ~PT_WRITABLE_MASK; - if (is_writeble_pte(*shadow_pte)) { - *shadow_pte &= ~PT_WRITABLE_MASK; - kvm_arch_ops->tlb_flush(vcpu); - } - } - } - - if (access_bits & PT_WRITABLE_MASK) - mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT); - - page_header_update_slot(vcpu->kvm, shadow_pte, gaddr); - rmap_add(vcpu, shadow_pte); } static void inject_page_fault(struct kvm_vcpu *vcpu, @@ -1006,23 +972,6 @@ static void inject_page_fault(struct kvm kvm_arch_ops->inject_page_fault(vcpu, addr, err_code); } -static inline int fix_read_pf(u64 *shadow_ent) -{ - if ((*shadow_ent & PT_SHADOW_USER_MASK) && - !(*shadow_ent & PT_USER_MASK)) { - /* - * If supervisor write protect is disabled, we shadow kernel - * pages as user pages so we can trap the write access. - */ - *shadow_ent |= PT_USER_MASK; - *shadow_ent &= ~PT_WRITABLE_MASK; - - return 1; - - } - return 0; -} - static void paging_free(struct kvm_vcpu *vcpu) { nonpaging_free(vcpu); @@ -1047,10 +996,7 @@ static int paging64_init_context_common( context->free = paging_free; context->root_level = level; context->shadow_root_level = level; - mmu_alloc_roots(vcpu); - ASSERT(VALID_PAGE(context->root_hpa)); - kvm_arch_ops->set_cr3(vcpu, context->root_hpa | - (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK))); + context->root_hpa = INVALID_PAGE; return 0; } @@ -1069,10 +1015,7 @@ static int paging32_init_context(struct context->free = paging_free; context->root_level = PT32_ROOT_LEVEL; context->shadow_root_level = PT32E_ROOT_LEVEL; - mmu_alloc_roots(vcpu); - ASSERT(VALID_PAGE(context->root_hpa)); - kvm_arch_ops->set_cr3(vcpu, context->root_hpa | - (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK))); + context->root_hpa = INVALID_PAGE; return 0; } @@ -1107,18 +1050,33 @@ static void destroy_kvm_mmu(struct kvm_v int kvm_mmu_reset_context(struct kvm_vcpu *vcpu) { + destroy_kvm_mmu(vcpu); + return init_kvm_mmu(vcpu); +} + +int kvm_mmu_load(struct kvm_vcpu *vcpu) +{ int r; - destroy_kvm_mmu(vcpu); - r = init_kvm_mmu(vcpu); - if (r < 0) - goto out; + spin_lock(&vcpu->kvm->lock); r = mmu_topup_memory_caches(vcpu); + if (r) + goto out; + mmu_alloc_roots(vcpu); + kvm_arch_ops->set_cr3(vcpu, vcpu->mmu.root_hpa); + kvm_mmu_flush_tlb(vcpu); out: + spin_unlock(&vcpu->kvm->lock); return r; } +EXPORT_SYMBOL_GPL(kvm_mmu_load); -static void mmu_pre_write_zap_pte(struct kvm_vcpu *vcpu, +void kvm_mmu_unload(struct kvm_vcpu *vcpu) +{ + mmu_free_roots(vcpu); +} + +static void mmu_pte_write_zap_pte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, u64 *spte) { @@ -1137,7 +1095,22 @@ static void mmu_pre_write_zap_pte(struct *spte = 0; } -void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes) +static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *page, + u64 *spte, + const void *new, int bytes) +{ + if (page->role.level != PT_PAGE_TABLE_LEVEL) + return; + + if (page->role.glevels == PT32_ROOT_LEVEL) + paging32_update_pte(vcpu, page, spte, new, bytes); + else + paging64_update_pte(vcpu, page, spte, new, bytes); +} + +void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, + const u8 *old, const u8 *new, int bytes) { gfn_t gfn = gpa >> PAGE_SHIFT; struct kvm_mmu_page *page; @@ -1149,6 +1122,7 @@ void kvm_mmu_pre_write(struct kvm_vcpu * unsigned pte_size; unsigned page_offset; unsigned misaligned; + unsigned quadrant; int level; int flooded = 0; int npte; @@ -1169,6 +1143,7 @@ void kvm_mmu_pre_write(struct kvm_vcpu * continue; pte_size = page->role.glevels == PT32_ROOT_LEVEL ? 4 : 8; misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1); + misaligned |= bytes < 4; if (misaligned || flooded) { /* * Misaligned accesses are too much trouble to fix @@ -1200,21 +1175,20 @@ void kvm_mmu_pre_write(struct kvm_vcpu * page_offset <<= 1; npte = 2; } + quadrant = page_offset >> PAGE_SHIFT; page_offset &= ~PAGE_MASK; + if (quadrant != page->role.quadrant) + continue; } - spte = __va(page->page_hpa); - spte += page_offset / sizeof(*spte); + spte = &page->spt[page_offset / sizeof(*spte)]; while (npte--) { - mmu_pre_write_zap_pte(vcpu, page, spte); + mmu_pte_write_zap_pte(vcpu, page, spte); + mmu_pte_write_new_pte(vcpu, page, spte, new, bytes); ++spte; } } } -void kvm_mmu_post_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes) -{ -} - int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) { gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, gva); @@ -1243,13 +1217,6 @@ static void free_mmu_pages(struct kvm_vc struct kvm_mmu_page, link); kvm_mmu_zap_page(vcpu, page); } - while (!list_empty(&vcpu->free_pages)) { - page = list_entry(vcpu->free_pages.next, - struct kvm_mmu_page, link); - list_del(&page->link); - __free_page(pfn_to_page(page->page_hpa >> PAGE_SHIFT)); - page->page_hpa = INVALID_PAGE; - } free_page((unsigned long)vcpu->mmu.pae_root); } @@ -1260,18 +1227,7 @@ static int alloc_mmu_pages(struct kvm_vc ASSERT(vcpu); - for (i = 0; i < KVM_NUM_MMU_PAGES; i++) { - struct kvm_mmu_page *page_header = &vcpu->page_header_buf[i]; - - INIT_LIST_HEAD(&page_header->link); - if ((page = alloc_page(GFP_KERNEL)) == NULL) - goto error_1; - set_page_private(page, (unsigned long)page_header); - page_header->page_hpa = (hpa_t)page_to_pfn(page) << PAGE_SHIFT; - memset(__va(page_header->page_hpa), 0, PAGE_SIZE); - list_add(&page_header->link, &vcpu->free_pages); - ++vcpu->kvm->n_free_mmu_pages; - } + vcpu->kvm->n_free_mmu_pages = KVM_NUM_MMU_PAGES; /* * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. @@ -1296,7 +1252,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu { ASSERT(vcpu); ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa)); - ASSERT(list_empty(&vcpu->free_pages)); return alloc_mmu_pages(vcpu); } @@ -1305,7 +1260,6 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu) { ASSERT(vcpu); ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa)); - ASSERT(!list_empty(&vcpu->free_pages)); return init_kvm_mmu(vcpu); } @@ -1331,7 +1285,7 @@ void kvm_mmu_slot_remove_write_access(st if (!test_bit(slot, &page->slot_bitmap)) continue; - pt = __va(page->page_hpa); + pt = page->spt; for (i = 0; i < PT64_ENT_PER_PAGE; ++i) /* avoid RMW */ if (pt[i] & PT_WRITABLE_MASK) { @@ -1364,6 +1318,10 @@ void kvm_mmu_module_exit(void) kmem_cache_destroy(pte_chain_cache); if (rmap_desc_cache) kmem_cache_destroy(rmap_desc_cache); + if (mmu_page_cache) + kmem_cache_destroy(mmu_page_cache); + if (mmu_page_header_cache) + kmem_cache_destroy(mmu_page_header_cache); } int kvm_mmu_module_init(void) @@ -1379,6 +1337,18 @@ int kvm_mmu_module_init(void) if (!rmap_desc_cache) goto nomem; + mmu_page_cache = kmem_cache_create("kvm_mmu_page", + PAGE_SIZE, + PAGE_SIZE, 0, NULL, NULL); + if (!mmu_page_cache) + goto nomem; + + mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header", + sizeof(struct kvm_mmu_page), + 0, 0, NULL, NULL); + if (!mmu_page_header_cache) + goto nomem; + return 0; nomem: @@ -1482,7 +1452,7 @@ static int count_writable_mappings(struc int i; list_for_each_entry(page, &vcpu->kvm->active_mmu_pages, link) { - u64 *pt = __va(page->page_hpa); + u64 *pt = page->spt; if (page->role.level != PT_PAGE_TABLE_LEVEL) continue; diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h index 73ffbff..a7c5cb0 100644 --- a/drivers/kvm/paging_tmpl.h +++ b/drivers/kvm/paging_tmpl.h @@ -31,7 +31,6 @@ #if PTTYPE == 64 #define PT_INDEX(addr, level) PT64_INDEX(addr, level) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) #define PT_LEVEL_MASK(level) PT64_LEVEL_MASK(level) - #define PT_PTE_COPY_MASK PT64_PTE_COPY_MASK #ifdef CONFIG_X86_64 #define PT_MAX_FULL_LEVELS 4 #else @@ -46,7 +45,6 @@ #elif PTTYPE == 32 #define PT_INDEX(addr, level) PT32_INDEX(addr, level) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level) - #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK #define PT_MAX_FULL_LEVELS 2 #else #error Invalid PTTYPE value @@ -192,40 +190,143 @@ static void FNAME(mark_pagetable_dirty)( mark_page_dirty(kvm, walker->table_gfn[walker->level - 1]); } -static void FNAME(set_pte)(struct kvm_vcpu *vcpu, u64 guest_pte, - u64 *shadow_pte, u64 access_bits, gfn_t gfn) +static void FNAME(set_pte_common)(struct kvm_vcpu *vcpu, + u64 *shadow_pte, + gpa_t gaddr, + pt_element_t *gpte, + u64 access_bits, + int user_fault, + int write_fault, + int *ptwrite, + struct guest_walker *walker, + gfn_t gfn) { - ASSERT(*shadow_pte == 0); - access_bits &= guest_pte; - *shadow_pte = (guest_pte & PT_PTE_COPY_MASK); - set_pte_common(vcpu, shadow_pte, guest_pte & PT_BASE_ADDR_MASK, - guest_pte & PT_DIRTY_MASK, access_bits, gfn); + hpa_t paddr; + int dirty = *gpte & PT_DIRTY_MASK; + u64 spte = *shadow_pte; + int was_rmapped = is_rmap_pte(spte); + + pgprintk("%s: spte %llx gpte %llx access %llx write_fault %d" + " user_fault %d gfn %lx\n", + __FUNCTION__, spte, (u64)*gpte, access_bits, + write_fault, user_fault, gfn); + + if (write_fault && !dirty) { + *gpte |= PT_DIRTY_MASK; + dirty = 1; + FNAME(mark_pagetable_dirty)(vcpu->kvm, walker); + } + + spte |= PT_PRESENT_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK; + spte |= *gpte & PT64_NX_MASK; + if (!dirty) + access_bits &= ~PT_WRITABLE_MASK; + + paddr = gpa_to_hpa(vcpu, gaddr & PT64_BASE_ADDR_MASK); + + spte |= PT_PRESENT_MASK; + if (access_bits & PT_USER_MASK) + spte |= PT_USER_MASK; + + if (is_error_hpa(paddr)) { + spte |= gaddr; + spte |= PT_SHADOW_IO_MARK; + spte &= ~PT_PRESENT_MASK; + set_shadow_pte(shadow_pte, spte); + return; + } + + spte |= paddr; + + if ((access_bits & PT_WRITABLE_MASK) + || (write_fault && !is_write_protection(vcpu) && !user_fault)) { + struct kvm_mmu_page *shadow; + + spte |= PT_WRITABLE_MASK; + if (user_fault) { + mmu_unshadow(vcpu, gfn); + goto unshadowed; + } + + shadow = kvm_mmu_lookup_page(vcpu, gfn); + if (shadow) { + pgprintk("%s: found shadow page for %lx, marking ro\n", + __FUNCTION__, gfn); + access_bits &= ~PT_WRITABLE_MASK; + if (is_writeble_pte(spte)) { + spte &= ~PT_WRITABLE_MASK; + kvm_arch_ops->tlb_flush(vcpu); + } + if (write_fault) + *ptwrite = 1; + } + } + +unshadowed: + + if (access_bits & PT_WRITABLE_MASK) + mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT); + + set_shadow_pte(shadow_pte, spte); + page_header_update_slot(vcpu->kvm, shadow_pte, gaddr); + if (!was_rmapped) + rmap_add(vcpu, shadow_pte); } -static void FNAME(set_pde)(struct kvm_vcpu *vcpu, u64 guest_pde, - u64 *shadow_pte, u64 access_bits, gfn_t gfn) +static void FNAME(set_pte)(struct kvm_vcpu *vcpu, pt_element_t *gpte, + u64 *shadow_pte, u64 access_bits, + int user_fault, int write_fault, int *ptwrite, + struct guest_walker *walker, gfn_t gfn) +{ + access_bits &= *gpte; + FNAME(set_pte_common)(vcpu, shadow_pte, *gpte & PT_BASE_ADDR_MASK, + gpte, access_bits, user_fault, write_fault, + ptwrite, walker, gfn); +} + +static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, + u64 *spte, const void *pte, int bytes) +{ + pt_element_t gpte; + + if (bytes < sizeof(pt_element_t)) + return; + gpte = *(const pt_element_t *)pte; + if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) + return; + pgprintk("%s: gpte %llx spte %p\n", __FUNCTION__, (u64)gpte, spte); + FNAME(set_pte)(vcpu, &gpte, spte, PT_USER_MASK | PT_WRITABLE_MASK, 0, + 0, NULL, NULL, + (gpte & PT_BASE_ADDR_MASK) >> PAGE_SHIFT); +} + +static void FNAME(set_pde)(struct kvm_vcpu *vcpu, pt_element_t *gpde, + u64 *shadow_pte, u64 access_bits, + int user_fault, int write_fault, int *ptwrite, + struct guest_walker *walker, gfn_t gfn) { gpa_t gaddr; - ASSERT(*shadow_pte == 0); - access_bits &= guest_pde; + access_bits &= *gpde; gaddr = (gpa_t)gfn << PAGE_SHIFT; if (PTTYPE == 32 && is_cpuid_PSE36()) - gaddr |= (guest_pde & PT32_DIR_PSE36_MASK) << + gaddr |= (*gpde & PT32_DIR_PSE36_MASK) << (32 - PT32_DIR_PSE36_SHIFT); - *shadow_pte = guest_pde & PT_PTE_COPY_MASK; - set_pte_common(vcpu, shadow_pte, gaddr, - guest_pde & PT_DIRTY_MASK, access_bits, gfn); + FNAME(set_pte_common)(vcpu, shadow_pte, gaddr, + gpde, access_bits, user_fault, write_fault, + ptwrite, walker, gfn); } /* * Fetch a shadow pte for a specific level in the paging hierarchy. */ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, - struct guest_walker *walker) + struct guest_walker *walker, + int user_fault, int write_fault, int *ptwrite) { hpa_t shadow_addr; int level; + u64 *shadow_ent; u64 *prev_shadow_ent = NULL; pt_element_t *guest_ent = walker->ptep; @@ -242,37 +343,23 @@ static u64 *FNAME(fetch)(struct kvm_vcpu for (; ; level--) { u32 index = SHADOW_PT_INDEX(addr, level); - u64 *shadow_ent = ((u64 *)__va(shadow_addr)) + index; struct kvm_mmu_page *shadow_page; u64 shadow_pte; int metaphysical; gfn_t table_gfn; unsigned hugepage_access = 0; + shadow_ent = ((u64 *)__va(shadow_addr)) + index; if (is_present_pte(*shadow_ent) || is_io_pte(*shadow_ent)) { if (level == PT_PAGE_TABLE_LEVEL) - return shadow_ent; + break; shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK; prev_shadow_ent = shadow_ent; continue; } - if (level == PT_PAGE_TABLE_LEVEL) { - - if (walker->level == PT_DIRECTORY_LEVEL) { - if (prev_shadow_ent) - *prev_shadow_ent |= PT_SHADOW_PS_MARK; - FNAME(set_pde)(vcpu, *guest_ent, shadow_ent, - walker->inherited_ar, - walker->gfn); - } else { - ASSERT(walker->level == PT_PAGE_TABLE_LEVEL); - FNAME(set_pte)(vcpu, *guest_ent, shadow_ent, - walker->inherited_ar, - walker->gfn); - } - return shadow_ent; - } + if (level == PT_PAGE_TABLE_LEVEL) + break; if (level - 1 == PT_PAGE_TABLE_LEVEL && walker->level == PT_DIRECTORY_LEVEL) { @@ -289,90 +376,24 @@ static u64 *FNAME(fetch)(struct kvm_vcpu shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1, metaphysical, hugepage_access, shadow_ent); - shadow_addr = shadow_page->page_hpa; + shadow_addr = __pa(shadow_page->spt); shadow_pte = shadow_addr | PT_PRESENT_MASK | PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; *shadow_ent = shadow_pte; prev_shadow_ent = shadow_ent; } -} -/* - * The guest faulted for write. We need to - * - * - check write permissions - * - update the guest pte dirty bit - * - update our own dirty page tracking structures - */ -static int FNAME(fix_write_pf)(struct kvm_vcpu *vcpu, - u64 *shadow_ent, - struct guest_walker *walker, - gva_t addr, - int user, - int *write_pt) -{ - pt_element_t *guest_ent; - int writable_shadow; - gfn_t gfn; - struct kvm_mmu_page *page; - - if (is_writeble_pte(*shadow_ent)) - return !user || (*shadow_ent & PT_USER_MASK); - - writable_shadow = *shadow_ent & PT_SHADOW_WRITABLE_MASK; - if (user) { - /* - * User mode access. Fail if it's a kernel page or a read-only - * page. - */ - if (!(*shadow_ent & PT_SHADOW_USER_MASK) || !writable_shadow) - return 0; - ASSERT(*shadow_ent & PT_USER_MASK); - } else - /* - * Kernel mode access. Fail if it's a read-only page and - * supervisor write protection is enabled. - */ - if (!writable_shadow) { - if (is_write_protection(vcpu)) - return 0; - *shadow_ent &= ~PT_USER_MASK; - } - - guest_ent = walker->ptep; - - if (!is_present_pte(*guest_ent)) { - *shadow_ent = 0; - return 0; + if (walker->level == PT_DIRECTORY_LEVEL) { + FNAME(set_pde)(vcpu, guest_ent, shadow_ent, + walker->inherited_ar, user_fault, write_fault, + ptwrite, walker, walker->gfn); + } else { + ASSERT(walker->level == PT_PAGE_TABLE_LEVEL); + FNAME(set_pte)(vcpu, guest_ent, shadow_ent, + walker->inherited_ar, user_fault, write_fault, + ptwrite, walker, walker->gfn); } - - gfn = walker->gfn; - - if (user) { - /* - * Usermode page faults won't be for page table updates. - */ - while ((page = kvm_mmu_lookup_page(vcpu, gfn)) != NULL) { - pgprintk("%s: zap %lx %x\n", - __FUNCTION__, gfn, page->role.word); - kvm_mmu_zap_page(vcpu, page); - } - } else if (kvm_mmu_lookup_page(vcpu, gfn)) { - pgprintk("%s: found shadow page for %lx, marking ro\n", - __FUNCTION__, gfn); - mark_page_dirty(vcpu->kvm, gfn); - FNAME(mark_pagetable_dirty)(vcpu->kvm, walker); - *guest_ent |= PT_DIRTY_MASK; - *write_pt = 1; - return 0; - } - mark_page_dirty(vcpu->kvm, gfn); - *shadow_ent |= PT_WRITABLE_MASK; - FNAME(mark_pagetable_dirty)(vcpu->kvm, walker); - *guest_ent |= PT_DIRTY_MASK; - rmap_add(vcpu, shadow_ent); - - return 1; + return shadow_ent; } /* @@ -397,7 +418,6 @@ static int FNAME(page_fault)(struct kvm_ int fetch_fault = error_code & PFERR_FETCH_MASK; struct guest_walker walker; u64 *shadow_pte; - int fixed; int write_pt = 0; int r; @@ -421,27 +441,20 @@ static int FNAME(page_fault)(struct kvm_ pgprintk("%s: guest page fault\n", __FUNCTION__); inject_page_fault(vcpu, addr, walker.error_code); FNAME(release_walker)(&walker); + vcpu->last_pt_write_count = 0; /* reset fork detector */ return 0; } - shadow_pte = FNAME(fetch)(vcpu, addr, &walker); - pgprintk("%s: shadow pte %p %llx\n", __FUNCTION__, - shadow_pte, *shadow_pte); - - /* - * Update the shadow pte. - */ - if (write_fault) - fixed = FNAME(fix_write_pf)(vcpu, shadow_pte, &walker, addr, - user_fault, &write_pt); - else - fixed = fix_read_pf(shadow_pte); - - pgprintk("%s: updated shadow pte %p %llx\n", __FUNCTION__, - shadow_pte, *shadow_pte); + shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault, + &write_pt); + pgprintk("%s: shadow pte %p %llx ptwrite %d\n", __FUNCTION__, + shadow_pte, *shadow_pte, write_pt); FNAME(release_walker)(&walker); + if (!write_pt) + vcpu->last_pt_write_count = 0; /* reset fork detector */ + /* * mmio: emulate if accessible, otherwise its a guest fault. */ @@ -478,7 +491,5 @@ #undef PT_BASE_ADDR_MASK #undef PT_INDEX #undef SHADOW_PT_INDEX #undef PT_LEVEL_MASK -#undef PT_PTE_COPY_MASK -#undef PT_NON_PTE_COPY_MASK #undef PT_DIR_BASE_ADDR_MASK #undef PT_MAX_FULL_LEVELS diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index fa17d6d..ec040e2 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -378,7 +378,7 @@ static __init int svm_hardware_setup(voi int cpu; struct page *iopm_pages; struct page *msrpm_pages; - void *msrpm_va; + void *iopm_va, *msrpm_va; int r; kvm_emulator_want_group7_invlpg(); @@ -387,8 +387,10 @@ static __init int svm_hardware_setup(voi if (!iopm_pages) return -ENOMEM; - memset(page_address(iopm_pages), 0xff, - PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + + iopm_va = page_address(iopm_pages); + memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */ iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; @@ -1481,6 +1483,10 @@ static int svm_vcpu_run(struct kvm_vcpu int r; again: + r = kvm_mmu_reload(vcpu); + if (unlikely(r)) + return r; + if (!vcpu->mmio_read_completed) do_interrupt_requests(vcpu, kvm_run); diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index 184238e..a534e6f 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -34,11 +34,15 @@ MODULE_LICENSE("GPL"); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); +static struct page *vmx_io_bitmap_a; +static struct page *vmx_io_bitmap_b; + #ifdef CONFIG_X86_64 #define HOST_IS_64 1 #else #define HOST_IS_64 0 #endif +#define EFER_SAVE_RESTORE_BITS ((u64)EFER_SCE) static struct vmcs_descriptor { int size; @@ -82,18 +86,17 @@ #endif }; #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index) -#ifdef CONFIG_X86_64 -static unsigned msr_offset_kernel_gs_base; -#define NR_64BIT_MSRS 4 -/* - * avoid save/load MSR_SYSCALL_MASK and MSR_LSTAR by std vt - * mechanism (cpu bug AA24) - */ -#define NR_BAD_MSRS 2 -#else -#define NR_64BIT_MSRS 0 -#define NR_BAD_MSRS 0 -#endif +static inline u64 msr_efer_save_restore_bits(struct vmx_msr_entry msr) +{ + return (u64)msr.data & EFER_SAVE_RESTORE_BITS; +} + +static inline int msr_efer_need_save_restore(struct kvm_vcpu *vcpu) +{ + int efer_offset = vcpu->msr_offset_efer; + return msr_efer_save_restore_bits(vcpu->host_msrs[efer_offset]) != + msr_efer_save_restore_bits(vcpu->guest_msrs[efer_offset]); +} static inline int is_page_fault(u32 intr_info) { @@ -115,13 +118,23 @@ static inline int is_external_interrupt( == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); } -static struct vmx_msr_entry *find_msr_entry(struct kvm_vcpu *vcpu, u32 msr) +static int __find_msr_index(struct kvm_vcpu *vcpu, u32 msr) { int i; for (i = 0; i < vcpu->nmsrs; ++i) if (vcpu->guest_msrs[i].index == msr) - return &vcpu->guest_msrs[i]; + return i; + return -1; +} + +static struct vmx_msr_entry *find_msr_entry(struct kvm_vcpu *vcpu, u32 msr) +{ + int i; + + i = __find_msr_index(vcpu, msr); + if (i >= 0) + return &vcpu->guest_msrs[i]; return NULL; } @@ -234,6 +247,127 @@ static void vmcs_set_bits(unsigned long vmcs_writel(field, vmcs_readl(field) | mask); } +static void update_exception_bitmap(struct kvm_vcpu *vcpu) +{ + u32 eb; + + eb = 1u << PF_VECTOR; + if (!vcpu->fpu_active) + eb |= 1u << NM_VECTOR; + if (vcpu->guest_debug.enabled) + eb |= 1u << 1; + if (vcpu->rmode.active) + eb = ~0; + vmcs_write32(EXCEPTION_BITMAP, eb); +} + +static void reload_tss(void) +{ +#ifndef CONFIG_X86_64 + + /* + * VT restores TR but not its size. Useless. + */ + struct descriptor_table gdt; + struct segment_descriptor *descs; + + get_gdt(&gdt); + descs = (void *)gdt.base; + descs[GDT_ENTRY_TSS].type = 9; /* available TSS */ + load_TR_desc(); +#endif +} + +static void load_transition_efer(struct kvm_vcpu *vcpu) +{ + u64 trans_efer; + int efer_offset = vcpu->msr_offset_efer; + + trans_efer = vcpu->host_msrs[efer_offset].data; + trans_efer &= ~EFER_SAVE_RESTORE_BITS; + trans_efer |= msr_efer_save_restore_bits( + vcpu->guest_msrs[efer_offset]); + wrmsrl(MSR_EFER, trans_efer); + vcpu->stat.efer_reload++; +} + +static void vmx_save_host_state(struct kvm_vcpu *vcpu) +{ + struct vmx_host_state *hs = &vcpu->vmx_host_state; + + if (hs->loaded) + return; + + hs->loaded = 1; + /* + * Set host fs and gs selectors. Unfortunately, 22.2.3 does not + * allow segment selectors with cpl > 0 or ti == 1. + */ + hs->ldt_sel = read_ldt(); + hs->fs_gs_ldt_reload_needed = hs->ldt_sel; + hs->fs_sel = read_fs(); + if (!(hs->fs_sel & 7)) + vmcs_write16(HOST_FS_SELECTOR, hs->fs_sel); + else { + vmcs_write16(HOST_FS_SELECTOR, 0); + hs->fs_gs_ldt_reload_needed = 1; + } + hs->gs_sel = read_gs(); + if (!(hs->gs_sel & 7)) + vmcs_write16(HOST_GS_SELECTOR, hs->gs_sel); + else { + vmcs_write16(HOST_GS_SELECTOR, 0); + hs->fs_gs_ldt_reload_needed = 1; + } + +#ifdef CONFIG_X86_64 + vmcs_writel(HOST_FS_BASE, read_msr(MSR_FS_BASE)); + vmcs_writel(HOST_GS_BASE, read_msr(MSR_GS_BASE)); +#else + vmcs_writel(HOST_FS_BASE, segment_base(hs->fs_sel)); + vmcs_writel(HOST_GS_BASE, segment_base(hs->gs_sel)); +#endif + +#ifdef CONFIG_X86_64 + if (is_long_mode(vcpu)) { + save_msrs(vcpu->host_msrs + vcpu->msr_offset_kernel_gs_base, 1); + } +#endif + load_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); + if (msr_efer_need_save_restore(vcpu)) + load_transition_efer(vcpu); +} + +static void vmx_load_host_state(struct kvm_vcpu *vcpu) +{ + struct vmx_host_state *hs = &vcpu->vmx_host_state; + + if (!hs->loaded) + return; + + hs->loaded = 0; + if (hs->fs_gs_ldt_reload_needed) { + load_ldt(hs->ldt_sel); + load_fs(hs->fs_sel); + /* + * If we have to reload gs, we must take care to + * preserve our gs base. + */ + local_irq_disable(); + load_gs(hs->gs_sel); +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); +#endif + local_irq_enable(); + + reload_tss(); + } + save_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); + load_msrs(vcpu->host_msrs, vcpu->save_nmsrs); + if (msr_efer_need_save_restore(vcpu)) + load_msrs(vcpu->host_msrs + vcpu->msr_offset_efer, 1); +} + /* * Switches to specified vcpu, until a matching vcpu_put(), but assumes * vcpu mutex is already taken. @@ -280,9 +414,31 @@ static void vmx_vcpu_load(struct kvm_vcp static void vmx_vcpu_put(struct kvm_vcpu *vcpu) { + vmx_load_host_state(vcpu); + kvm_put_guest_fpu(vcpu); put_cpu(); } +static void vmx_fpu_activate(struct kvm_vcpu *vcpu) +{ + if (vcpu->fpu_active) + return; + vcpu->fpu_active = 1; + vmcs_clear_bits(GUEST_CR0, CR0_TS_MASK); + if (vcpu->cr0 & CR0_TS_MASK) + vmcs_set_bits(GUEST_CR0, CR0_TS_MASK); + update_exception_bitmap(vcpu); +} + +static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu) +{ + if (!vcpu->fpu_active) + return; + vcpu->fpu_active = 0; + vmcs_set_bits(GUEST_CR0, CR0_TS_MASK); + update_exception_bitmap(vcpu); +} + static void vmx_vcpu_decache(struct kvm_vcpu *vcpu) { vcpu_clear(vcpu); @@ -331,41 +487,61 @@ static void vmx_inject_gp(struct kvm_vcp } /* + * Swap MSR entry in host/guest MSR entry array. + */ +void move_msr_up(struct kvm_vcpu *vcpu, int from, int to) +{ + struct vmx_msr_entry tmp; + tmp = vcpu->guest_msrs[to]; + vcpu->guest_msrs[to] = vcpu->guest_msrs[from]; + vcpu->guest_msrs[from] = tmp; + tmp = vcpu->host_msrs[to]; + vcpu->host_msrs[to] = vcpu->host_msrs[from]; + vcpu->host_msrs[from] = tmp; +} + +/* * Set up the vmcs to automatically save and restore system * msrs. Don't touch the 64-bit msrs if the guest is in legacy * mode, as fiddling with msrs is very expensive. */ static void setup_msrs(struct kvm_vcpu *vcpu) { - int nr_skip, nr_good_msrs; - - if (is_long_mode(vcpu)) - nr_skip = NR_BAD_MSRS; - else - nr_skip = NR_64BIT_MSRS; - nr_good_msrs = vcpu->nmsrs - nr_skip; + int save_nmsrs; - /* - * MSR_K6_STAR is only needed on long mode guests, and only - * if efer.sce is enabled. - */ - if (find_msr_entry(vcpu, MSR_K6_STAR)) { - --nr_good_msrs; + save_nmsrs = 0; #ifdef CONFIG_X86_64 - if (is_long_mode(vcpu) && (vcpu->shadow_efer & EFER_SCE)) - ++nr_good_msrs; -#endif + if (is_long_mode(vcpu)) { + int index; + + index = __find_msr_index(vcpu, MSR_SYSCALL_MASK); + if (index >= 0) + move_msr_up(vcpu, index, save_nmsrs++); + index = __find_msr_index(vcpu, MSR_LSTAR); + if (index >= 0) + move_msr_up(vcpu, index, save_nmsrs++); + index = __find_msr_index(vcpu, MSR_CSTAR); + if (index >= 0) + move_msr_up(vcpu, index, save_nmsrs++); + index = __find_msr_index(vcpu, MSR_KERNEL_GS_BASE); + if (index >= 0) + move_msr_up(vcpu, index, save_nmsrs++); + /* + * MSR_K6_STAR is only needed on long mode guests, and only + * if efer.sce is enabled. + */ + index = __find_msr_index(vcpu, MSR_K6_STAR); + if ((index >= 0) && (vcpu->shadow_efer & EFER_SCE)) + move_msr_up(vcpu, index, save_nmsrs++); } +#endif + vcpu->save_nmsrs = save_nmsrs; - vmcs_writel(VM_ENTRY_MSR_LOAD_ADDR, - virt_to_phys(vcpu->guest_msrs + nr_skip)); - vmcs_writel(VM_EXIT_MSR_STORE_ADDR, - virt_to_phys(vcpu->guest_msrs + nr_skip)); - vmcs_writel(VM_EXIT_MSR_LOAD_ADDR, - virt_to_phys(vcpu->host_msrs + nr_skip)); - vmcs_write32(VM_EXIT_MSR_STORE_COUNT, nr_good_msrs); /* 22.2.2 */ - vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, nr_good_msrs); /* 22.2.2 */ - vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, nr_good_msrs); /* 22.2.2 */ +#ifdef CONFIG_X86_64 + vcpu->msr_offset_kernel_gs_base = + __find_msr_index(vcpu, MSR_KERNEL_GS_BASE); +#endif + vcpu->msr_offset_efer = __find_msr_index(vcpu, MSR_EFER); } /* @@ -393,23 +569,6 @@ static void guest_write_tsc(u64 guest_ts vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc); } -static void reload_tss(void) -{ -#ifndef CONFIG_X86_64 - - /* - * VT restores TR but not its size. Useless. - */ - struct descriptor_table gdt; - struct segment_descriptor *descs; - - get_gdt(&gdt); - descs = (void *)gdt.base; - descs[GDT_ENTRY_TSS].type = 9; /* available TSS */ - load_TR_desc(); -#endif -} - /* * Reads an msr value (of 'msr_index') into 'pdata'. * Returns 0 on success, non-0 otherwise. @@ -469,10 +628,15 @@ #endif static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) { struct vmx_msr_entry *msr; + int ret = 0; + switch (msr_index) { #ifdef CONFIG_X86_64 case MSR_EFER: - return kvm_set_msr_common(vcpu, msr_index, data); + ret = kvm_set_msr_common(vcpu, msr_index, data); + if (vcpu->vmx_host_state.loaded) + load_transition_efer(vcpu); + break; case MSR_FS_BASE: vmcs_writel(GUEST_FS_BASE, data); break; @@ -496,14 +660,14 @@ #endif msr = find_msr_entry(vcpu, msr_index); if (msr) { msr->data = data; + if (vcpu->vmx_host_state.loaded) + load_msrs(vcpu->guest_msrs, vcpu->save_nmsrs); break; } - return kvm_set_msr_common(vcpu, msr_index, data); - msr->data = data; - break; + ret = kvm_set_msr_common(vcpu, msr_index, data); } - return 0; + return ret; } /* @@ -529,10 +693,8 @@ static void vcpu_put_rsp_rip(struct kvm_ static int set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_debug_guest *dbg) { unsigned long dr7 = 0x400; - u32 exception_bitmap; int old_singlestep; - exception_bitmap = vmcs_read32(EXCEPTION_BITMAP); old_singlestep = vcpu->guest_debug.singlestep; vcpu->guest_debug.enabled = dbg->enabled; @@ -548,13 +710,9 @@ static int set_guest_debug(struct kvm_vc dr7 |= 0 << (i*4+16); /* execution breakpoint */ } - exception_bitmap |= (1u << 1); /* Trap debug exceptions */ - vcpu->guest_debug.singlestep = dbg->singlestep; - } else { - exception_bitmap &= ~(1u << 1); /* Ignore debug exceptions */ + } else vcpu->guest_debug.singlestep = 0; - } if (old_singlestep && !vcpu->guest_debug.singlestep) { unsigned long flags; @@ -564,7 +722,7 @@ static int set_guest_debug(struct kvm_vc vmcs_writel(GUEST_RFLAGS, flags); } - vmcs_write32(EXCEPTION_BITMAP, exception_bitmap); + update_exception_bitmap(vcpu); vmcs_writel(GUEST_DR7, dr7); return 0; @@ -678,14 +836,6 @@ static __exit void hardware_unsetup(void free_kvm_area(); } -static void update_exception_bitmap(struct kvm_vcpu *vcpu) -{ - if (vcpu->rmode.active) - vmcs_write32(EXCEPTION_BITMAP, ~0); - else - vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR); -} - static void fix_pmode_dataseg(int seg, struct kvm_save_segment *save) { struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; @@ -836,6 +986,8 @@ static void vmx_decache_cr4_guest_bits(s static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) { + vmx_fpu_deactivate(vcpu); + if (vcpu->rmode.active && (cr0 & CR0_PE_MASK)) enter_pmode(vcpu); @@ -851,26 +1003,20 @@ #ifdef CONFIG_X86_64 } #endif - if (!(cr0 & CR0_TS_MASK)) { - vcpu->fpu_active = 1; - vmcs_clear_bits(EXCEPTION_BITMAP, CR0_TS_MASK); - } - vmcs_writel(CR0_READ_SHADOW, cr0); vmcs_writel(GUEST_CR0, (cr0 & ~KVM_GUEST_CR0_MASK) | KVM_VM_CR0_ALWAYS_ON); vcpu->cr0 = cr0; + + if (!(cr0 & CR0_TS_MASK) || !(cr0 & CR0_PE_MASK)) + vmx_fpu_activate(vcpu); } static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) { vmcs_writel(GUEST_CR3, cr3); - - if (!(vcpu->cr0 & CR0_TS_MASK)) { - vcpu->fpu_active = 0; - vmcs_set_bits(GUEST_CR0, CR0_TS_MASK); - vmcs_set_bits(EXCEPTION_BITMAP, 1 << NM_VECTOR); - } + if (vcpu->cr0 & CR0_PE_MASK) + vmx_fpu_deactivate(vcpu); } static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) @@ -936,23 +1082,11 @@ static void vmx_get_segment(struct kvm_v var->unusable = (ar >> 16) & 1; } -static void vmx_set_segment(struct kvm_vcpu *vcpu, - struct kvm_segment *var, int seg) +static u32 vmx_segment_access_rights(struct kvm_segment *var) { - struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; u32 ar; - vmcs_writel(sf->base, var->base); - vmcs_write32(sf->limit, var->limit); - vmcs_write16(sf->selector, var->selector); - if (vcpu->rmode.active && var->s) { - /* - * Hack real-mode segments into vm86 compatibility. - */ - if (var->base == 0xffff0000 && var->selector == 0xf000) - vmcs_writel(sf->base, 0xf0000); - ar = 0xf3; - } else if (var->unusable) + if (var->unusable) ar = 1 << 16; else { ar = var->type & 15; @@ -966,6 +1100,35 @@ static void vmx_set_segment(struct kvm_v } if (ar == 0) /* a 0 value means unusable */ ar = AR_UNUSABLE_MASK; + + return ar; +} + +static void vmx_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + u32 ar; + + if (vcpu->rmode.active && seg == VCPU_SREG_TR) { + vcpu->rmode.tr.selector = var->selector; + vcpu->rmode.tr.base = var->base; + vcpu->rmode.tr.limit = var->limit; + vcpu->rmode.tr.ar = vmx_segment_access_rights(var); + return; + } + vmcs_writel(sf->base, var->base); + vmcs_write32(sf->limit, var->limit); + vmcs_write16(sf->selector, var->selector); + if (vcpu->rmode.active && var->s) { + /* + * Hack real-mode segments into vm86 compatibility. + */ + if (var->base == 0xffff0000 && var->selector == 0xf000) + vmcs_writel(sf->base, 0xf0000); + ar = 0xf3; + } else + ar = vmx_segment_access_rights(var); vmcs_write32(sf->ar_bytes, ar); } @@ -1065,7 +1228,7 @@ static int vmx_vcpu_setup(struct kvm_vcp struct descriptor_table dt; int i; int ret = 0; - extern asmlinkage void kvm_vmx_return(void); + unsigned long kvm_vmx_return; if (!init_rmode_tss(vcpu->kvm)) { ret = -ENOMEM; @@ -1128,8 +1291,8 @@ static int vmx_vcpu_setup(struct kvm_vcp vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0); /* I/O */ - vmcs_write64(IO_BITMAP_A, 0); - vmcs_write64(IO_BITMAP_B, 0); + vmcs_write64(IO_BITMAP_A, page_to_phys(vmx_io_bitmap_a)); + vmcs_write64(IO_BITMAP_B, page_to_phys(vmx_io_bitmap_b)); guest_write_tsc(0); @@ -1149,12 +1312,11 @@ static int vmx_vcpu_setup(struct kvm_vcp CPU_BASED_HLT_EXITING /* 20.6.2 */ | CPU_BASED_CR8_LOAD_EXITING /* 20.6.2 */ | CPU_BASED_CR8_STORE_EXITING /* 20.6.2 */ - | CPU_BASED_UNCOND_IO_EXITING /* 20.6.2 */ + | CPU_BASED_ACTIVATE_IO_BITMAP /* 20.6.2 */ | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING /* 21.3 */ ); - vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR); vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */ @@ -1184,8 +1346,11 @@ #endif get_idt(&dt); vmcs_writel(HOST_IDTR_BASE, dt.base); /* 22.2.4 */ - - vmcs_writel(HOST_RIP, (unsigned long)kvm_vmx_return); /* 22.2.5 */ + asm ("mov $.Lkvm_vmx_return, %0" : "=r"(kvm_vmx_return)); + vmcs_writel(HOST_RIP, kvm_vmx_return); /* 22.2.5 */ + vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0); + vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0); + vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0); rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk); vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs); @@ -1209,10 +1374,6 @@ #endif vcpu->host_msrs[j].reserved = 0; vcpu->host_msrs[j].data = data; vcpu->guest_msrs[j] = vcpu->host_msrs[j]; -#ifdef CONFIG_X86_64 - if (index == MSR_KERNEL_GS_BASE) - msr_offset_kernel_gs_base = j; -#endif ++vcpu->nmsrs; } @@ -1240,6 +1401,8 @@ #endif #ifdef CONFIG_X86_64 vmx_set_efer(vcpu, 0); #endif + vmx_fpu_activate(vcpu); + update_exception_bitmap(vcpu); return 0; @@ -1364,7 +1527,11 @@ static int handle_rmode_exception(struct if (!vcpu->rmode.active) return 0; - if (vec == GP_VECTOR && err_code == 0) + /* + * Instruction with address size override prefix opcode 0x67 + * Cause the #SS fault with 0 error code in VM86 mode. + */ + if (((vec == GP_VECTOR) || (vec == SS_VECTOR)) && err_code == 0) if (emulate_instruction(vcpu, NULL, 0, 0) == EMULATE_DONE) return 1; return 0; @@ -1399,10 +1566,7 @@ static int handle_exception(struct kvm_v } if (is_no_device(intr_info)) { - vcpu->fpu_active = 1; - vmcs_clear_bits(EXCEPTION_BITMAP, 1 << NM_VECTOR); - if (!(vcpu->cr0 & CR0_TS_MASK)) - vmcs_clear_bits(GUEST_CR0, CR0_TS_MASK); + vmx_fpu_activate(vcpu); return 1; } @@ -1594,11 +1758,10 @@ static int handle_cr(struct kvm_vcpu *vc break; case 2: /* clts */ vcpu_load_rsp_rip(vcpu); - vcpu->fpu_active = 1; - vmcs_clear_bits(EXCEPTION_BITMAP, 1 << NM_VECTOR); - vmcs_clear_bits(GUEST_CR0, CR0_TS_MASK); + vmx_fpu_deactivate(vcpu); vcpu->cr0 &= ~CR0_TS_MASK; vmcs_writel(CR0_READ_SHADOW, vcpu->cr0); + vmx_fpu_activate(vcpu); skip_emulated_instruction(vcpu); return 1; case 1: /*mov from cr*/ @@ -1769,7 +1932,7 @@ static int (*kvm_vmx_exit_handlers[])(st }; static const int kvm_vmx_max_exit_handlers = - sizeof(kvm_vmx_exit_handlers) / sizeof(*kvm_vmx_exit_handlers); + ARRAY_SIZE(kvm_vmx_exit_handlers); /* * The guest has exited. See if we can fix it or if we need userspace @@ -1812,57 +1975,28 @@ static int dm_request_for_irq_injection( static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { u8 fail; - u16 fs_sel, gs_sel, ldt_sel; - int fs_gs_ldt_reload_needed; int r; -again: - /* - * Set host fs and gs selectors. Unfortunately, 22.2.3 does not - * allow segment selectors with cpl > 0 or ti == 1. - */ - fs_sel = read_fs(); - gs_sel = read_gs(); - ldt_sel = read_ldt(); - fs_gs_ldt_reload_needed = (fs_sel & 7) | (gs_sel & 7) | ldt_sel; - if (!fs_gs_ldt_reload_needed) { - vmcs_write16(HOST_FS_SELECTOR, fs_sel); - vmcs_write16(HOST_GS_SELECTOR, gs_sel); - } else { - vmcs_write16(HOST_FS_SELECTOR, 0); - vmcs_write16(HOST_GS_SELECTOR, 0); - } - -#ifdef CONFIG_X86_64 - vmcs_writel(HOST_FS_BASE, read_msr(MSR_FS_BASE)); - vmcs_writel(HOST_GS_BASE, read_msr(MSR_GS_BASE)); -#else - vmcs_writel(HOST_FS_BASE, segment_base(fs_sel)); - vmcs_writel(HOST_GS_BASE, segment_base(gs_sel)); -#endif - +preempted: if (!vcpu->mmio_read_completed) do_interrupt_requests(vcpu, kvm_run); if (vcpu->guest_debug.enabled) kvm_guest_debug_pre(vcpu); - if (vcpu->fpu_active) { - fx_save(vcpu->host_fx_image); - fx_restore(vcpu->guest_fx_image); - } +again: + vmx_save_host_state(vcpu); + kvm_load_guest_fpu(vcpu); + + r = kvm_mmu_reload(vcpu); + if (unlikely(r)) + goto out; + /* * Loading guest fpu may have cleared host cr0.ts */ vmcs_writel(HOST_CR0, read_cr0()); -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - save_msrs(vcpu->host_msrs + msr_offset_kernel_gs_base, 1); - load_msrs(vcpu->guest_msrs, NR_BAD_MSRS); - } -#endif - asm ( /* Store host registers */ "pushf \n\t" @@ -1910,12 +2044,11 @@ #else "mov %c[rcx](%3), %%ecx \n\t" /* kills %3 (ecx) */ #endif /* Enter guest mode */ - "jne launched \n\t" + "jne .Llaunched \n\t" ASM_VMX_VMLAUNCH "\n\t" - "jmp kvm_vmx_return \n\t" - "launched: " ASM_VMX_VMRESUME "\n\t" - ".globl kvm_vmx_return \n\t" - "kvm_vmx_return: " + "jmp .Lkvm_vmx_return \n\t" + ".Llaunched: " ASM_VMX_VMRESUME "\n\t" + ".Lkvm_vmx_return: " /* Save guest registers, load host registers, keep flags */ #ifdef CONFIG_X86_64 "xchg %3, (%%rsp) \n\t" @@ -1982,80 +2115,54 @@ #endif [cr2]"i"(offsetof(struct kvm_vcpu, cr2)) : "cc", "memory" ); - /* - * Reload segment selectors ASAP. (it's needed for a functional - * kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64 - * relies on having 0 in %gs for the CPU PDA to work.) - */ - if (fs_gs_ldt_reload_needed) { - load_ldt(ldt_sel); - load_fs(fs_sel); - /* - * If we have to reload gs, we must take care to - * preserve our gs base. - */ - local_irq_disable(); - load_gs(gs_sel); -#ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); -#endif - local_irq_enable(); - - reload_tss(); - } ++vcpu->stat.exits; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - save_msrs(vcpu->guest_msrs, NR_BAD_MSRS); - load_msrs(vcpu->host_msrs, NR_BAD_MSRS); - } -#endif - - if (vcpu->fpu_active) { - fx_save(vcpu->guest_fx_image); - fx_restore(vcpu->host_fx_image); - } - vcpu->interrupt_window_open = (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0; asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); - if (fail) { + if (unlikely(fail)) { kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; kvm_run->fail_entry.hardware_entry_failure_reason = vmcs_read32(VM_INSTRUCTION_ERROR); r = 0; - } else { - /* - * Profile KVM exit RIPs: - */ - if (unlikely(prof_on == KVM_PROFILING)) - profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP)); - - vcpu->launched = 1; - r = kvm_handle_exit(kvm_run, vcpu); - if (r > 0) { - /* Give scheduler a change to reschedule. */ - if (signal_pending(current)) { - ++vcpu->stat.signal_exits; - post_kvm_run_save(vcpu, kvm_run); - kvm_run->exit_reason = KVM_EXIT_INTR; - return -EINTR; - } - - if (dm_request_for_irq_injection(vcpu, kvm_run)) { - ++vcpu->stat.request_irq_exits; - post_kvm_run_save(vcpu, kvm_run); - kvm_run->exit_reason = KVM_EXIT_INTR; - return -EINTR; - } - - kvm_resched(vcpu); + goto out; + } + /* + * Profile KVM exit RIPs: + */ + if (unlikely(prof_on == KVM_PROFILING)) + profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP)); + + vcpu->launched = 1; + r = kvm_handle_exit(kvm_run, vcpu); + if (r > 0) { + /* Give scheduler a change to reschedule. */ + if (signal_pending(current)) { + r = -EINTR; + kvm_run->exit_reason = KVM_EXIT_INTR; + ++vcpu->stat.signal_exits; + goto out; + } + + if (dm_request_for_irq_injection(vcpu, kvm_run)) { + r = -EINTR; + kvm_run->exit_reason = KVM_EXIT_INTR; + ++vcpu->stat.request_irq_exits; + goto out; + } + if (!need_resched()) { + ++vcpu->stat.light_exits; goto again; } } +out: + if (r > 0) { + kvm_resched(vcpu); + goto preempted; + } + post_kvm_run_save(vcpu, kvm_run); return r; } @@ -2128,7 +2235,6 @@ static int vmx_create_vcpu(struct kvm_vc vmcs_clear(vmcs); vcpu->vmcs = vmcs; vcpu->launched = 0; - vcpu->fpu_active = 1; return 0; @@ -2194,11 +2300,50 @@ #endif static int __init vmx_init(void) { - return kvm_init_arch(&vmx_arch_ops, THIS_MODULE); + void *iova; + int r; + + vmx_io_bitmap_a = alloc_page(GFP_KERNEL | __GFP_HIGHMEM); + if (!vmx_io_bitmap_a) + return -ENOMEM; + + vmx_io_bitmap_b = alloc_page(GFP_KERNEL | __GFP_HIGHMEM); + if (!vmx_io_bitmap_b) { + r = -ENOMEM; + goto out; + } + + /* + * Allow direct access to the PC debug port (it is often used for I/O + * delays, but the vmexits simply slow things down). + */ + iova = kmap(vmx_io_bitmap_a); + memset(iova, 0xff, PAGE_SIZE); + clear_bit(0x80, iova); + kunmap(vmx_io_bitmap_a); + + iova = kmap(vmx_io_bitmap_b); + memset(iova, 0xff, PAGE_SIZE); + kunmap(vmx_io_bitmap_b); + + r = kvm_init_arch(&vmx_arch_ops, THIS_MODULE); + if (r) + goto out1; + + return 0; + +out1: + __free_page(vmx_io_bitmap_b); +out: + __free_page(vmx_io_bitmap_a); + return r; } static void __exit vmx_exit(void) { + __free_page(vmx_io_bitmap_b); + __free_page(vmx_io_bitmap_a); + kvm_exit_arch(); } diff --git a/drivers/kvm/x86_emulate.c b/drivers/kvm/x86_emulate.c index 7ade090..6123c02 100644 --- a/drivers/kvm/x86_emulate.c +++ b/drivers/kvm/x86_emulate.c @@ -152,7 +152,7 @@ static u8 opcode_table[256] = { static u16 twobyte_table[256] = { /* 0x00 - 0x0F */ 0, SrcMem | ModRM | DstReg, 0, 0, 0, 0, ImplicitOps, 0, - 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, + 0, ImplicitOps, 0, 0, 0, ImplicitOps | ModRM, 0, 0, /* 0x10 - 0x1F */ 0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0, /* 0x20 - 0x2F */ @@ -1304,6 +1304,8 @@ twobyte_special_insn: /* Disable writeback. */ dst.orig_val = dst.val; switch (b) { + case 0x09: /* wbinvd */ + break; case 0x0d: /* GrpP (prefetch) */ case 0x18: /* Grp16 (prefetch/nop) */ break;