Index: linux/Documentation/hrtimer/highres.txt =================================================================== --- /dev/null +++ linux/Documentation/hrtimer/highres.txt @@ -0,0 +1,249 @@ +High resolution timers and dynamic ticks design notes +----------------------------------------------------- + +Further information can be found in the paper of the OLS 2006 talk "hrtimers +and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can +be found on the OLS website: +http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf + +The slides to this talk are available from: +http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf + +The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the +changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the +design of the Linux time(r) system before hrtimers and other building blocks +got merged into mainline. + +Note: the paper and the slides are talking about "clock event source", while we +switched to the name "clock event devices" in meantime. + +The design contains the following basic building blocks: + +- hrtimer base infrastructure +- timeofday and clock source management +- clock event management +- high resolution timer functionality +- dynamic ticks + + +hrtimer base infrastructure +--------------------------- + +The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of +the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See +also figure #2 (OLS slides p. 15) + +The main differences to the timer wheel, which holds the armed timer_list type +timers are: + - time ordered enqueueing into a rb-tree + - independent of ticks (the processing is based on nanoseconds) + + +timeofday and clock source management +------------------------------------- + +John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of +code out of the architecture-specific areas into a generic management +framework, as illustrated in figure #3 (OLS slides p. 18). The architecture +specific portion is reduced to the low level hardware details of the clock +sources, which are registered in the framework and selected on a quality based +decision. The low level code provides hardware setup and readout routines and +initializes data structures, which are used by the generic time keeping code to +convert the clock ticks to nanosecond based time values. All other time keeping +related functionality is moved into the generic code. The GTOD base patch got +merged into the 2.6.18 kernel. + +Further information about the Generic Time Of Day framework is available in the +OLS 2005 Proceedings Volume 1: +http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf + +The paper "We Are Not Getting Any Younger: A New Approach to Time and +Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. + +Figure #3 (OLS slides p.18) illustrates the transformation. + + +clock event management +---------------------- + +While clock sources provide read access to the monotonically increasing time +value, clock event devices are used to schedule the next event +interrupt(s). The next event is currently defined to be periodic, with its +period defined at compile time. The setup and selection of the event device +for various event driven functionalities is hardwired into the architecture +dependent code. This results in duplicated code across all architectures and +makes it extremely difficult to change the configuration of the system to use +event interrupt devices other than those already built into the +architecture. Another implication of the current design is that it is necessary +to touch all the architecture-specific implementations in order to provide new +functionality like high resolution timers or dynamic ticks. + +The clock events subsystem tries to address this problem by providing a generic +solution to manage clock event devices and their usage for the various clock +event driven kernel functionalities. The goal of the clock event subsystem is +to minimize the clock event related architecture dependent code to the pure +hardware related handling and to allow easy addition and utilization of new +clock event devices. It also minimizes the duplicated code across the +architectures as it provides generic functionality down to the interrupt +service handler, which is almost inherently hardware dependent. + +Clock event devices are registered either by the architecture dependent boot +code or at module insertion time. Each clock event device fills a data +structure with clock-specific property parameters and callback functions. The +clock event management decides, by using the specified property parameters, the +set of system functions a clock event device will be used to support. This +includes the distinction of per-CPU and per-system global event devices. + +System-level global event devices are used for the Linux periodic tick. Per-CPU +event devices are used to provide local CPU functionality such as process +accounting, profiling, and high resolution timers. + +The management layer assignes one or more of the folliwing functions to a clock +event device: + - system global periodic tick (jiffies update) + - cpu local update_process_times + - cpu local profiling + - cpu local next event interrupt (non periodic mode) + +The clock event device delegates the selection of those timer interrupt related +functions completely to the management layer. The clock management layer stores +a function pointer in the device description structure, which has to be called +from the hardware level handler. This removes a lot of duplicated code from the +architecture specific timer interrupt handlers and hands the control over the +clock event devices and the assignment of timer interrupt related functionality +to the core code. + +The clock event layer API is rather small. Aside from the clock event device +registration interface it provides functions to schedule the next event +interrupt, clock event device notification service and support for suspend and +resume. + +The framework adds about 700 lines of code which results in a 2KB increase of +the kernel binary size. The conversion of i386 removes about 100 lines of +code. The binary size decrease is in the range of 400 byte. We believe that the +increase of flexibility and the avoidance of duplicated code across +architectures justifies the slight increase of the binary size. + +The conversion of an architecture has no functional impact, but allows to +utilize the high resolution and dynamic tick functionalites without any change +to the clock event device and timer interrupt code. After the conversion the +enabling of high resolution timers and dynamic ticks is simply provided by +adding the kernel/time/Kconfig file to the architecture specific Kconfig and +adding the dynamic tick specific calls to the idle routine (a total of 3 lines +added to the idle function and the Kconfig file) + +Figure #4 (OLS slides p.20) illustrates the transformation. + + +high resolution timer functionality +----------------------------------- + +During system boot it is not possible to use the high resolution timer +functionality, while making it possible would be difficult and would serve no +useful function. The initialization of the clock event device framework, the +clock source framework (GTOD) and hrtimers itself has to be done and +appropriate clock sources and clock event devices have to be registered before +the high resolution functionality can work. Up to the point where hrtimers are +initialized, the system works in the usual low resolution periodic mode. The +clock source and the clock event device layers provide notification functions +which inform hrtimers about availability of new hardware. hrtimers validates +the usability of the registered clock sources and clock event devices before +switching to high resolution mode. This ensures also that a kernel which is +configured for high resolution timers can run on a system which lacks the +necessary hardware support. + +The high resolution timer code does not support SMP machines which have only +global clock event devices. The support of such hardware would involve IPI +calls when an interrupt happens. The overhead would be much larger than the +benefit. This is the reason why we currently disable high resolution and +dynamic ticks on i386 SMP systems which stop the local APIC in C3 power +state. A workaround is available as an idea, but the problem has not been +tackled yet. + +The time ordered insertion of timers provides all the infrastructure to decide +whether the event device has to be reprogrammed when a timer is added. The +decision is made per timer base and synchronized across per-cpu timer bases in +a support function. The design allows the system to utilize separate per-CPU +clock event devices for the per-CPU timer bases, but currently only one +reprogrammable clock event device per-CPU is utilized. + +When the timer interrupt happens, the next event interrupt handler is called +from the clock event distribution code and moves expired timers from the +red-black tree to a separate double linked list and invokes the softirq +handler. An additional mode field in the hrtimer structure allows the system to +execute callback functions directly from the next event interrupt handler. This +is restricted to code which can safely be executed in the hard interrupt +context. This applies, for example, to the common case of a wakeup function as +used by nanosleep. The advantage of executing the handler in the interrupt +context is the avoidance of up to two context switches - from the interrupted +context to the softirq and to the task which is woken up by the expired +timer. + +Once a system has switched to high resolution mode, the periodic tick is +switched off. This disables the per system global periodic clock event device - +e.g. the PIT on i386 SMP systems. + +The periodic tick functionality is provided by an per-cpu hrtimer. The callback +function is executed in the next event interrupt context and updates jiffies +and calls update_process_times and profiling. The implementation of the hrtimer +based periodic tick is designed to be extended with dynamic tick functionality. +This allows to use a single clock event device to schedule high resolution +timer and periodic events (jiffies tick, profiling, process accounting) on UP +systems. This has been proved to work with the PIT on i386 and the Incrementer +on PPC. + +The softirq for running the hrtimer queues and executing the callbacks has been +separated from the tick bound timer softirq to allow accurate delivery of high +resolution timer signals which are used by itimer and POSIX interval +timers. The execution of this softirq can still be delayed by other softirqs, +but the overall latencies have been significantly improved by this separation. + +Figure #5 (OLS slides p.22) illustrates the transformation. + + +dynamic ticks +------------- + +Dynamic ticks are the logical consequence of the hrtimer based periodic tick +replacement (sched_tick). The functionality of the sched_tick hrtimer is +extended by three functions: + +- hrtimer_stop_sched_tick +- hrtimer_restart_sched_tick +- hrtimer_update_jiffies + +hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code +evaluates the next scheduled timer event (from both hrtimers and the timer +wheel) and in case that the next event is further away than the next tick it +reprograms the sched_tick to this future event, to allow longer idle sleeps +without worthless interruption by the periodic tick. The function is also +called when an interrupt happens during the idle period, which does not cause a +reschedule. The call is necessary as the interrupt handler might have armed a +new timer whose expiry time is before the time which was identified as the +nearest event in the previous call to hrtimer_stop_sched_tick. + +hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before +it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, +which is kept active until the next call to hrtimer_stop_sched_tick(). + +hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens +in the idle period to make sure that jiffies are up to date and the interrupt +handler has not to deal with an eventually stale jiffy value. + +The dynamic tick feature provides statistical values which are exported to +userspace via /proc/stats and can be made available for enhanced power +management control. + +The implementation leaves room for further development like full tickless +systems, where the time slice is controlled by the scheduler, variable +frequency profiling, and a complete removal of jiffies in the future. + + +Aside the current initial submission of i386 support, the patchset has been +extended to x86_64 and ARM already. Initial (work in progress) support is also +available for MIPS and PowerPC. + + Thomas, Ingo + + + Index: linux/Documentation/hrtimer/hrtimers.txt =================================================================== --- /dev/null +++ linux/Documentation/hrtimer/hrtimers.txt @@ -0,0 +1,178 @@ + +hrtimers - subsystem for high-resolution kernel timers +---------------------------------------------------- + +This patch introduces a new subsystem for high-resolution kernel timers. + +One might ask the question: we already have a timer subsystem +(kernel/timers.c), why do we need two timer subsystems? After a lot of +back and forth trying to integrate high-resolution and high-precision +features into the existing timer framework, and after testing various +such high-resolution timer implementations in practice, we came to the +conclusion that the timer wheel code is fundamentally not suitable for +such an approach. We initially didn't believe this ('there must be a way +to solve this'), and spent a considerable effort trying to integrate +things into the timer wheel, but we failed. In hindsight, there are +several reasons why such integration is hard/impossible: + +- the forced handling of low-resolution and high-resolution timers in + the same way leads to a lot of compromises, macro magic and #ifdef + mess. The timers.c code is very "tightly coded" around jiffies and + 32-bitness assumptions, and has been honed and micro-optimized for a + relatively narrow use case (jiffies in a relatively narrow HZ range) + for many years - and thus even small extensions to it easily break + the wheel concept, leading to even worse compromises. The timer wheel + code is very good and tight code, there's zero problems with it in its + current usage - but it is simply not suitable to be extended for + high-res timers. + +- the unpredictable [O(N)] overhead of cascading leads to delays which + necessitate a more complex handling of high resolution timers, which + in turn decreases robustness. Such a design still led to rather large + timing inaccuracies. Cascading is a fundamental property of the timer + wheel concept, it cannot be 'designed out' without unevitably + degrading other portions of the timers.c code in an unacceptable way. + +- the implementation of the current posix-timer subsystem on top of + the timer wheel has already introduced a quite complex handling of + the required readjusting of absolute CLOCK_REALTIME timers at + settimeofday or NTP time - further underlying our experience by + example: that the timer wheel data structure is too rigid for high-res + timers. + +- the timer wheel code is most optimal for use cases which can be + identified as "timeouts". Such timeouts are usually set up to cover + error conditions in various I/O paths, such as networking and block + I/O. The vast majority of those timers never expire and are rarely + recascaded because the expected correct event arrives in time so they + can be removed from the timer wheel before any further processing of + them becomes necessary. Thus the users of these timeouts can accept + the granularity and precision tradeoffs of the timer wheel, and + largely expect the timer subsystem to have near-zero overhead. + Accurate timing for them is not a core purpose - in fact most of the + timeout values used are ad-hoc. For them it is at most a necessary + evil to guarantee the processing of actual timeout completions + (because most of the timeouts are deleted before completion), which + should thus be as cheap and unintrusive as possible. + +The primary users of precision timers are user-space applications that +utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel +users like drivers and subsystems which require precise timed events +(e.g. multimedia) can benefit from the availability of a separate +high-resolution timer subsystem as well. + +While this subsystem does not offer high-resolution clock sources just +yet, the hrtimer subsystem can be easily extended with high-resolution +clock capabilities, and patches for that exist and are maturing quickly. +The increasing demand for realtime and multimedia applications along +with other potential users for precise timers gives another reason to +separate the "timeout" and "precise timer" subsystems. + +Another potential benefit is that such a separation allows even more +special-purpose optimization of the existing timer wheel for the low +resolution and low precision use cases - once the precision-sensitive +APIs are separated from the timer wheel and are migrated over to +hrtimers. E.g. we could decrease the frequency of the timeout subsystem +from 250 Hz to 100 HZ (or even smaller). + +hrtimer subsystem implementation details +---------------------------------------- + +the basic design considerations were: + +- simplicity + +- data structure not bound to jiffies or any other granularity. All the + kernel logic works at 64-bit nanoseconds resolution - no compromises. + +- simplification of existing, timing related kernel code + +another basic requirement was the immediate enqueueing and ordering of +timers at activation time. After looking at several possible solutions +such as radix trees and hashes, we chose the red black tree as the basic +data structure. Rbtrees are available as a library in the kernel and are +used in various performance-critical areas of e.g. memory management and +file systems. The rbtree is solely used for time sorted ordering, while +a separate list is used to give the expiry code fast access to the +queued timers, without having to walk the rbtree. + +(This separate list is also useful for later when we'll introduce +high-resolution clocks, where we need separate pending and expired +queues while keeping the time-order intact.) + +Time-ordered enqueueing is not purely for the purposes of +high-resolution clocks though, it also simplifies the handling of +absolute timers based on a low-resolution CLOCK_REALTIME. The existing +implementation needed to keep an extra list of all armed absolute +CLOCK_REALTIME timers along with complex locking. In case of +settimeofday and NTP, all the timers (!) had to be dequeued, the +time-changing code had to fix them up one by one, and all of them had to +be enqueued again. The time-ordered enqueueing and the storage of the +expiry time in absolute time units removes all this complex and poorly +scaling code from the posix-timer implementation - the clock can simply +be set without having to touch the rbtree. This also makes the handling +of posix-timers simpler in general. + +The locking and per-CPU behavior of hrtimers was mostly taken from the +existing timer wheel code, as it is mature and well suited. Sharing code +was not really a win, due to the different data structures. Also, the +hrtimer functions now have clearer behavior and clearer names - such as +hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly +equivalent to del_timer() and del_timer_sync()] - so there's no direct +1:1 mapping between them on the algorithmical level, and thus no real +potential for code sharing either. + +Basic data types: every time value, absolute or relative, is in a +special nanosecond-resolution type: ktime_t. The kernel-internal +representation of ktime_t values and operations is implemented via +macros and inline functions, and can be switched between a "hybrid +union" type and a plain "scalar" 64bit nanoseconds representation (at +compile time). The hybrid union type optimizes time conversions on 32bit +CPUs. This build-time-selectable ktime_t storage format was implemented +to avoid the performance impact of 64-bit multiplications and divisions +on 32bit CPUs. Such operations are frequently necessary to convert +between the storage formats provided by kernel and userspace interfaces +and the internal time format. (See include/linux/ktime.h for further +details.) + +hrtimers - rounding of timer values +----------------------------------- + +the hrtimer code will round timer events to lower-resolution clocks +because it has to. Otherwise it will do no artificial rounding at all. + +one question is, what resolution value should be returned to the user by +the clock_getres() interface. This will return whatever real resolution +a given clock has - be it low-res, high-res, or artificially-low-res. + +hrtimers - testing and verification +---------------------------------- + +We used the high-resolution clock subsystem ontop of hrtimers to verify +the hrtimer implementation details in praxis, and we also ran the posix +timer tests in order to ensure specification compliance. We also ran +tests on low-resolution clocks. + +The hrtimer patch converts the following kernel functionality to use +hrtimers: + + - nanosleep + - itimers + - posix-timers + +The conversion of nanosleep and posix-timers enabled the unification of +nanosleep and clock_nanosleep. + +The code was successfully compiled for the following platforms: + + i386, x86_64, ARM, PPC, PPC64, IA64 + +The code was run-tested on the following platforms: + + i386(UP/SMP), x86_64(UP/SMP), ARM, PPC + +hrtimers were also integrated into the -rt tree, along with a +hrtimers-based high-resolution clock implementation, so the hrtimers +code got a healthy amount of testing and use in practice. + + Thomas Gleixner, Ingo Molnar Index: linux/Documentation/hrtimer/timer_stats.txt =================================================================== --- /dev/null +++ linux/Documentation/hrtimer/timer_stats.txt @@ -0,0 +1,68 @@ +timer_stats - timer usage statistics +------------------------------------ + +timer_stats is a debugging facility to make the timer (ab)usage in a Linux +system visible to kernel and userspace developers. It is not intended for +production usage as it adds significant overhead to the (hr)timer code and the +(hr)timer data structures. + +timer_stats should be used by kernel and userspace developers to verify that +their code does not make unduly use of timers. This helps to avoid unnecessary +wakeups, which should be avoided to optimize power consumption. + +It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration +section. + +timer_stats collects information about the timer events which are fired in a +Linux system over a sample period: + +- the pid of the task(process) which initialized the timer +- the name of the process which initialized the timer +- the function where the timer was intialized +- the callback function which is associated to the timer +- the number of events (callbacks) + +timer_stats adds an entry to /proc: /proc/timer_stats + +This entry is used to control the statistics functionality and to read out the +sampled information. + +The timer_stats functionality is inactive on bootup. + +To activate a sample period issue: +# echo 1 >/proc/timer_stats + +To stop a sample period issue: +# echo 0 >/proc/timer_stats + +The statistics can be retrieved by: +# cat /proc/timer_stats + +The readout of /proc/timer_stats automatically disables sampling. The sampled +information is kept until a new sample period is started. This allows multiple +readouts. + +Sample output of /proc/timer_stats: + +Timerstats sample period: 3.888770 s + 12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 15, 1 swapper hcd_submit_urb (rh_timer_func) + 4, 959 kedac schedule_timeout (process_timeout) + 1, 0 swapper page_writeback_init (wb_timer_fn) + 28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn) + 3, 3100 bash schedule_timeout (process_timeout) + 1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) + 1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) + 1, 1 swapper neigh_table_init_no_netlink (neigh_periodic_timer) + 1, 2292 ip __netdev_watchdog_up (dev_watchdog) + 1, 23 events/1 do_cache_clean (delayed_work_timer_fn) +90 total events, 30.0 events/sec + +The first column is the number of events, the second column the pid, the third +column is the name of the process. The forth column shows the function which +initialized the timer and in parantheses the callback function which was +executed on expiry. + + Thomas, Ingo + Index: linux/Documentation/hrtimers.txt =================================================================== --- linux.orig/Documentation/hrtimers.txt +++ /dev/null @@ -1,178 +0,0 @@ - -hrtimers - subsystem for high-resolution kernel timers ----------------------------------------------------- - -This patch introduces a new subsystem for high-resolution kernel timers. - -One might ask the question: we already have a timer subsystem -(kernel/timers.c), why do we need two timer subsystems? After a lot of -back and forth trying to integrate high-resolution and high-precision -features into the existing timer framework, and after testing various -such high-resolution timer implementations in practice, we came to the -conclusion that the timer wheel code is fundamentally not suitable for -such an approach. We initially didn't believe this ('there must be a way -to solve this'), and spent a considerable effort trying to integrate -things into the timer wheel, but we failed. In hindsight, there are -several reasons why such integration is hard/impossible: - -- the forced handling of low-resolution and high-resolution timers in - the same way leads to a lot of compromises, macro magic and #ifdef - mess. The timers.c code is very "tightly coded" around jiffies and - 32-bitness assumptions, and has been honed and micro-optimized for a - relatively narrow use case (jiffies in a relatively narrow HZ range) - for many years - and thus even small extensions to it easily break - the wheel concept, leading to even worse compromises. The timer wheel - code is very good and tight code, there's zero problems with it in its - current usage - but it is simply not suitable to be extended for - high-res timers. - -- the unpredictable [O(N)] overhead of cascading leads to delays which - necessitate a more complex handling of high resolution timers, which - in turn decreases robustness. Such a design still led to rather large - timing inaccuracies. Cascading is a fundamental property of the timer - wheel concept, it cannot be 'designed out' without unevitably - degrading other portions of the timers.c code in an unacceptable way. - -- the implementation of the current posix-timer subsystem on top of - the timer wheel has already introduced a quite complex handling of - the required readjusting of absolute CLOCK_REALTIME timers at - settimeofday or NTP time - further underlying our experience by - example: that the timer wheel data structure is too rigid for high-res - timers. - -- the timer wheel code is most optimal for use cases which can be - identified as "timeouts". Such timeouts are usually set up to cover - error conditions in various I/O paths, such as networking and block - I/O. The vast majority of those timers never expire and are rarely - recascaded because the expected correct event arrives in time so they - can be removed from the timer wheel before any further processing of - them becomes necessary. Thus the users of these timeouts can accept - the granularity and precision tradeoffs of the timer wheel, and - largely expect the timer subsystem to have near-zero overhead. - Accurate timing for them is not a core purpose - in fact most of the - timeout values used are ad-hoc. For them it is at most a necessary - evil to guarantee the processing of actual timeout completions - (because most of the timeouts are deleted before completion), which - should thus be as cheap and unintrusive as possible. - -The primary users of precision timers are user-space applications that -utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel -users like drivers and subsystems which require precise timed events -(e.g. multimedia) can benefit from the availability of a separate -high-resolution timer subsystem as well. - -While this subsystem does not offer high-resolution clock sources just -yet, the hrtimer subsystem can be easily extended with high-resolution -clock capabilities, and patches for that exist and are maturing quickly. -The increasing demand for realtime and multimedia applications along -with other potential users for precise timers gives another reason to -separate the "timeout" and "precise timer" subsystems. - -Another potential benefit is that such a separation allows even more -special-purpose optimization of the existing timer wheel for the low -resolution and low precision use cases - once the precision-sensitive -APIs are separated from the timer wheel and are migrated over to -hrtimers. E.g. we could decrease the frequency of the timeout subsystem -from 250 Hz to 100 HZ (or even smaller). - -hrtimer subsystem implementation details ----------------------------------------- - -the basic design considerations were: - -- simplicity - -- data structure not bound to jiffies or any other granularity. All the - kernel logic works at 64-bit nanoseconds resolution - no compromises. - -- simplification of existing, timing related kernel code - -another basic requirement was the immediate enqueueing and ordering of -timers at activation time. After looking at several possible solutions -such as radix trees and hashes, we chose the red black tree as the basic -data structure. Rbtrees are available as a library in the kernel and are -used in various performance-critical areas of e.g. memory management and -file systems. The rbtree is solely used for time sorted ordering, while -a separate list is used to give the expiry code fast access to the -queued timers, without having to walk the rbtree. - -(This separate list is also useful for later when we'll introduce -high-resolution clocks, where we need separate pending and expired -queues while keeping the time-order intact.) - -Time-ordered enqueueing is not purely for the purposes of -high-resolution clocks though, it also simplifies the handling of -absolute timers based on a low-resolution CLOCK_REALTIME. The existing -implementation needed to keep an extra list of all armed absolute -CLOCK_REALTIME timers along with complex locking. In case of -settimeofday and NTP, all the timers (!) had to be dequeued, the -time-changing code had to fix them up one by one, and all of them had to -be enqueued again. The time-ordered enqueueing and the storage of the -expiry time in absolute time units removes all this complex and poorly -scaling code from the posix-timer implementation - the clock can simply -be set without having to touch the rbtree. This also makes the handling -of posix-timers simpler in general. - -The locking and per-CPU behavior of hrtimers was mostly taken from the -existing timer wheel code, as it is mature and well suited. Sharing code -was not really a win, due to the different data structures. Also, the -hrtimer functions now have clearer behavior and clearer names - such as -hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly -equivalent to del_timer() and del_timer_sync()] - so there's no direct -1:1 mapping between them on the algorithmical level, and thus no real -potential for code sharing either. - -Basic data types: every time value, absolute or relative, is in a -special nanosecond-resolution type: ktime_t. The kernel-internal -representation of ktime_t values and operations is implemented via -macros and inline functions, and can be switched between a "hybrid -union" type and a plain "scalar" 64bit nanoseconds representation (at -compile time). The hybrid union type optimizes time conversions on 32bit -CPUs. This build-time-selectable ktime_t storage format was implemented -to avoid the performance impact of 64-bit multiplications and divisions -on 32bit CPUs. Such operations are frequently necessary to convert -between the storage formats provided by kernel and userspace interfaces -and the internal time format. (See include/linux/ktime.h for further -details.) - -hrtimers - rounding of timer values ------------------------------------ - -the hrtimer code will round timer events to lower-resolution clocks -because it has to. Otherwise it will do no artificial rounding at all. - -one question is, what resolution value should be returned to the user by -the clock_getres() interface. This will return whatever real resolution -a given clock has - be it low-res, high-res, or artificially-low-res. - -hrtimers - testing and verification ----------------------------------- - -We used the high-resolution clock subsystem ontop of hrtimers to verify -the hrtimer implementation details in praxis, and we also ran the posix -timer tests in order to ensure specification compliance. We also ran -tests on low-resolution clocks. - -The hrtimer patch converts the following kernel functionality to use -hrtimers: - - - nanosleep - - itimers - - posix-timers - -The conversion of nanosleep and posix-timers enabled the unification of -nanosleep and clock_nanosleep. - -The code was successfully compiled for the following platforms: - - i386, x86_64, ARM, PPC, PPC64, IA64 - -The code was run-tested on the following platforms: - - i386(UP/SMP), x86_64(UP/SMP), ARM, PPC - -hrtimers were also integrated into the -rt tree, along with a -hrtimers-based high-resolution clock implementation, so the hrtimers -code got a healthy amount of testing and use in practice. - - Thomas Gleixner, Ingo Molnar Index: linux/Documentation/kernel-parameters.txt =================================================================== --- linux.orig/Documentation/kernel-parameters.txt +++ linux/Documentation/kernel-parameters.txt @@ -597,6 +597,10 @@ and is between 256 and 4096 characters. highmem otherwise. This also works to reduce highmem size on bigger boxes. + highres= [KNL] Enable/disable high resolution timer mode. + Valid parameters: "on", "off" + Default: "on" + hisax= [HW,ISDN] See Documentation/isdn/README.HiSax. @@ -653,6 +657,10 @@ and is between 256 and 4096 characters. idle= [HW] Format: idle=poll or idle=halt + ignore_loglevel [KNL] + Ignore loglevel setting - this will print /all/ + kernel messages to the console. Useful for debugging. + ihash_entries= [KNL] Set number of hash buckets for inode cache. @@ -748,6 +756,11 @@ and is between 256 and 4096 characters. lapic [IA-32,APIC] Enable the local APIC even if BIOS disabled it. + lapictimer [IA-32,APIC] Enable the local APIC timer on UP + systems for high resulution timers and dynticks. + This only has an effect when the local APIC is + available. It does not imply the "lapic" option. + lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:,irq: Index: linux/Makefile =================================================================== --- linux.orig/Makefile +++ linux/Makefile @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 19 -EXTRAVERSION = +EXTRAVERSION = -rt7 NAME=Avast! A bilge rat! # *DOCUMENTATION* @@ -491,10 +491,14 @@ endif include $(srctree)/arch/$(ARCH)/Makefile -ifdef CONFIG_FRAME_POINTER -CFLAGS += -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,) +ifdef CONFIG_MCOUNT +CFLAGS += -pg -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,) else -CFLAGS += -fomit-frame-pointer + ifdef CONFIG_FRAME_POINTER + CFLAGS += -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,) + else + CFLAGS += -fomit-frame-pointer + endif endif ifdef CONFIG_UNWIND_INFO Index: linux/arch/arm/Kconfig =================================================================== --- linux.orig/arch/arm/Kconfig +++ linux/arch/arm/Kconfig @@ -487,18 +487,7 @@ config LOCAL_TIMERS accounting to be spread across the timer interval, preventing a "thundering herd" at every timer tick. -config PREEMPT - bool "Preemptible Kernel (EXPERIMENTAL)" - depends on EXPERIMENTAL - help - This option reduces the latency of the kernel when reacting to - real-time or interactive events by allowing a low priority process to - be preempted even if it is in kernel mode executing a system call. - This allows applications to run more reliably even when the system is - under load. - - Say Y here if you are building a kernel for a desktop, embedded - or real-time system. Say N if you are unsure. +source kernel/Kconfig.preempt config NO_IDLE_HZ bool "Dynamic tick timer" Index: linux/arch/arm/boot/compressed/head.S =================================================================== --- linux.orig/arch/arm/boot/compressed/head.S +++ linux/arch/arm/boot/compressed/head.S @@ -833,6 +833,19 @@ memdump: mov r12, r0 mov pc, r10 #endif +#ifdef CONFIG_MCOUNT +/* CONFIG_MCOUNT causes boot header to be built with -pg requiring this + * trampoline + */ + .text + .align 0 + .type mcount %function + .global mcount +mcount: + mov pc, lr @ just return +#endif + + reloc_end: .align Index: linux/arch/arm/common/time-acorn.c =================================================================== --- linux.orig/arch/arm/common/time-acorn.c +++ linux/arch/arm/common/time-acorn.c @@ -77,7 +77,7 @@ ioc_timer_interrupt(int irq, void *dev_i static struct irqaction ioc_timer_irq = { .name = "timer", - .flags = IRQF_DISABLED, + .flags = IRQF_DISABLED | IRQF_NODELAY, .handler = ioc_timer_interrupt }; Index: linux/arch/arm/kernel/dma.c =================================================================== --- linux.orig/arch/arm/kernel/dma.c +++ linux/arch/arm/kernel/dma.c @@ -20,7 +20,7 @@ #include -DEFINE_SPINLOCK(dma_spin_lock); +DEFINE_RAW_SPINLOCK(dma_spin_lock); EXPORT_SYMBOL(dma_spin_lock); static dma_t dma_chan[MAX_DMA_CHANNELS]; Index: linux/arch/arm/kernel/entry-armv.S =================================================================== --- linux.orig/arch/arm/kernel/entry-armv.S +++ linux/arch/arm/kernel/entry-armv.S @@ -204,7 +204,7 @@ __irq_svc: irq_handler #ifdef CONFIG_PREEMPT ldr r0, [tsk, #TI_FLAGS] @ get flags - tst r0, #_TIF_NEED_RESCHED + tst r0, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED blne svc_preempt preempt_return: ldr r0, [tsk, #TI_PREEMPT] @ read preempt value @@ -235,7 +235,7 @@ svc_preempt: str r7, [tsk, #TI_PREEMPT] @ expects preempt_count == 0 1: bl preempt_schedule_irq @ irq en/disable is done inside ldr r0, [tsk, #TI_FLAGS] @ get new tasks TI_FLAGS - tst r0, #_TIF_NEED_RESCHED + tst r0, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED beq preempt_return @ go again b 1b #endif Index: linux/arch/arm/kernel/entry-common.S =================================================================== --- linux.orig/arch/arm/kernel/entry-common.S +++ linux/arch/arm/kernel/entry-common.S @@ -3,6 +3,8 @@ * * Copyright (C) 2000 Russell King * + * FUNCTION_TRACE/mcount support (C) 2005 Timesys john.cooper@timesys.com + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. @@ -40,7 +42,7 @@ ret_fast_syscall: fast_work_pending: str r0, [sp, #S_R0+S_OFF]! @ returned r0 work_pending: - tst r1, #_TIF_NEED_RESCHED + tst r1, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED bne work_resched tst r1, #_TIF_NOTIFY_RESUME | _TIF_SIGPENDING beq no_work_pending @@ -50,7 +52,8 @@ work_pending: b ret_slow_syscall @ Check work again work_resched: - bl schedule + bl __schedule + /* * "slow" syscall return path. "why" tells us if this was a real syscall. */ @@ -387,6 +390,112 @@ ENTRY(sys_oabi_call_table) #include "calls.S" #undef ABI #undef OBSOLETE +#endif + +#ifdef CONFIG_FRAME_POINTER + +#ifdef CONFIG_MCOUNT +/* + * At the point where we are in mcount() we maintain the + * frame of the prologue code and keep the call to mcount() + * out of the stack frame list: + + saved pc <---\ caller of instrumented routine + saved lr | + ip/prev_sp | + fp -----^ | + : | + | + -> saved pc | instrumented routine + | saved lr | + | ip/prev_sp | + | fp ---------/ + | : + | + | mcount + | saved pc + | saved lr + | ip/prev sp + -- fp + r3 + r2 + r1 + sp-> r0 + : + */ + + .text + .align 0 + .type mcount %function + .global mcount + +/* gcc -pg generated FUNCTION_PROLOGUE references mcount() + * and has already created the stack frame invocation for + * the routine we have been called to instrument. We create + * a complete frame nevertheless, as we want to use the same + * call to mcount() from c code. + */ +mcount: + + ldr ip, =mcount_enabled @ leave early, if disabled + ldr ip, [ip] + cmp ip, #0 + moveq pc, lr + + mov ip, sp + stmdb sp!, {r0 - r3, fp, ip, lr, pc} @ create stack frame + + ldr r1, [fp, #-4] @ get lr (the return address + @ of the caller of the + @ instrumented function) + mov r0, lr @ get lr - (the return address + @ of the instrumented function) + + sub fp, ip, #4 @ point fp at this frame + + bl __trace +1: + ldmdb fp, {r0 - r3, fp, sp, pc} @ pop entry frame and return + +#endif + +/* ARM replacement for unsupported gcc __builtin_return_address(n) + * where 0 < n. n == 0 is supported here as well. + * + * Walk up the stack frame until the desired frame is found or a NULL + * fp is encountered, return NULL in the latter case. + * + * Note: it is possible under code optimization for the stack invocation + * of an ancestor function (level N) to be removed before calling a + * descendant function (level N+1). No easy means is available to deduce + * this scenario with the result being [for example] caller_addr(0) when + * called from level N+1 returning level N-1 rather than the expected + * level N. This optimization issue appears isolated to the case of + * a call to a level N+1 routine made at the tail end of a level N + * routine -- the level N frame is deleted and a simple branch is made + * to the level N+1 routine. + */ + + .text + .align 0 + .type arm_return_addr %function + .global arm_return_addr + +arm_return_addr: + mov ip, r0 + mov r0, fp +3: + cmp r0, #0 + beq 1f @ frame list hit end, bail + cmp ip, #0 + beq 2f @ reached desired frame + ldr r0, [r0, #-12] @ else continue, get next fp + sub ip, ip, #1 + b 3b +2: + ldr r0, [r0, #-4] @ get target return address +1: + mov pc, lr #endif Index: linux/arch/arm/kernel/fiq.c =================================================================== --- linux.orig/arch/arm/kernel/fiq.c +++ linux/arch/arm/kernel/fiq.c @@ -89,7 +89,7 @@ void set_fiq_handler(void *start, unsign * disable irqs for the duration. Note - these functions are almost * entirely coded in assembly. */ -void __attribute__((naked)) set_fiq_regs(struct pt_regs *regs) +void notrace __attribute__((naked)) set_fiq_regs(struct pt_regs *regs) { register unsigned long tmp; asm volatile ( @@ -107,7 +107,7 @@ void __attribute__((naked)) set_fiq_regs : "r" (®s->ARM_r8), "I" (PSR_I_BIT | PSR_F_BIT | FIQ_MODE)); } -void __attribute__((naked)) get_fiq_regs(struct pt_regs *regs) +void notrace __attribute__((naked)) get_fiq_regs(struct pt_regs *regs) { register unsigned long tmp; asm volatile ( Index: linux/arch/arm/kernel/irq.c =================================================================== --- linux.orig/arch/arm/kernel/irq.c +++ linux/arch/arm/kernel/irq.c @@ -101,7 +101,7 @@ unlock: /* Handle bad interrupts */ static struct irq_desc bad_irq_desc = { .handle_irq = handle_bad_irq, - .lock = SPIN_LOCK_UNLOCKED + .lock = RAW_SPIN_LOCK_UNLOCKED(bad_irq_desc.lock) }; /* @@ -109,11 +109,13 @@ static struct irq_desc bad_irq_desc = { * come via this function. Instead, they should provide their * own 'handler' */ -asmlinkage void asm_do_IRQ(unsigned int irq, struct pt_regs *regs) +asmlinkage notrace void asm_do_IRQ(unsigned int irq, struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); struct irqdesc *desc = irq_desc + irq; + trace_special(instruction_pointer(regs), irq, 0); + /* * Some hardware gives randomly wrong interrupts. Rather * than crashing, do something sensible. @@ -159,8 +161,7 @@ void __init init_IRQ(void) int irq; for (irq = 0; irq < NR_IRQS; irq++) - irq_desc[irq].status |= IRQ_NOREQUEST | IRQ_DELAYED_DISABLE | - IRQ_NOPROBE; + irq_desc[irq].status |= IRQ_NOREQUEST | IRQ_NOPROBE; #ifdef CONFIG_SMP bad_irq_desc.affinity = CPU_MASK_ALL; Index: linux/arch/arm/kernel/process.c =================================================================== --- linux.orig/arch/arm/kernel/process.c +++ linux/arch/arm/kernel/process.c @@ -123,7 +123,7 @@ static void default_idle(void) cpu_relax(); else { local_irq_disable(); - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { timer_dyn_reprogram(); arch_idle(); } @@ -154,12 +154,16 @@ void cpu_idle(void) if (!idle) idle = default_idle; leds_event(led_idle_start); - while (!need_resched()) + hrtimer_stop_sched_tick(); + while (!need_resched() && !need_resched_delayed()) idle(); leds_event(led_idle_end); - preempt_enable_no_resched(); - schedule(); + hrtimer_restart_sched_tick(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } Index: linux/arch/arm/kernel/semaphore.c =================================================================== --- linux.orig/arch/arm/kernel/semaphore.c +++ linux/arch/arm/kernel/semaphore.c @@ -49,14 +49,16 @@ * we cannot lose wakeup events. */ -void __up(struct semaphore *sem) +fastcall void __attribute_used__ __compat_up(struct compat_semaphore *sem) { wake_up(&sem->wait); } +EXPORT_SYMBOL(__compat_up); + static DEFINE_SPINLOCK(semaphore_lock); -void __sched __down(struct semaphore * sem) +fastcall void __attribute_used__ __sched __compat_down(struct compat_semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -89,7 +91,9 @@ void __sched __down(struct semaphore * s wake_up(&sem->wait); } -int __sched __down_interruptible(struct semaphore * sem) +EXPORT_SYMBOL(__compat_down); + +fastcall int __attribute_used__ __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -140,6 +144,8 @@ int __sched __down_interruptible(struct return retval; } +EXPORT_SYMBOL(__compat_down_interruptible); + /* * Trylock failed - make sure we correct for * having decremented the count. @@ -148,7 +154,7 @@ int __sched __down_interruptible(struct * single "cmpxchg" without failure cases, * but then it wouldn't work on a 386. */ -int __down_trylock(struct semaphore * sem) +fastcall int __attribute_used__ __sched __compat_down_trylock(struct compat_semaphore * sem) { int sleepers; unsigned long flags; @@ -168,6 +174,15 @@ int __down_trylock(struct semaphore * se return 1; } +EXPORT_SYMBOL(__compat_down_trylock); + +fastcall int __sched compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} + +EXPORT_SYMBOL(compat_sem_is_locked); + /* * The semaphore operations have a special calling sequence that * allow us to do a simpler in-line version of them. These routines @@ -185,7 +200,7 @@ asm(" .section .sched.text,\"ax\",%progb __down_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down \n\ + bl __compat_down \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ .align 5 \n\ @@ -193,7 +208,7 @@ __down_failed: \n\ __down_interruptible_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down_interruptible \n\ + bl __compat_down_interruptible \n\ mov ip, r0 \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ @@ -202,7 +217,7 @@ __down_interruptible_failed: \n\ __down_trylock_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down_trylock \n\ + bl __compat_down_trylock \n\ mov ip, r0 \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ @@ -211,7 +226,7 @@ __down_trylock_failed: \n\ __up_wakeup: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __up \n\ + bl __compat_up \n\ ldmfd sp!, {r0 - r4, pc} \n\ "); Index: linux/arch/arm/kernel/signal.c =================================================================== --- linux.orig/arch/arm/kernel/signal.c +++ linux/arch/arm/kernel/signal.c @@ -630,6 +630,14 @@ static int do_signal(sigset_t *oldset, s siginfo_t info; int signr; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif + /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux/arch/arm/kernel/smp.c =================================================================== --- linux.orig/arch/arm/kernel/smp.c +++ linux/arch/arm/kernel/smp.c @@ -521,7 +521,7 @@ static void ipi_call_function(unsigned i cpu_clear(cpu, data->unfinished); } -static DEFINE_SPINLOCK(stop_lock); +static DEFINE_RAW_SPINLOCK(stop_lock); /* * ipi_cpu_stop - handle IPI from smp_send_stop() Index: linux/arch/arm/kernel/traps.c =================================================================== --- linux.orig/arch/arm/kernel/traps.c +++ linux/arch/arm/kernel/traps.c @@ -176,6 +176,7 @@ void dump_stack(void) { #ifdef CONFIG_DEBUG_ERRORS __backtrace(); + print_traces(current); #endif } @@ -216,7 +217,7 @@ static void __die(const char *str, int e } } -DEFINE_SPINLOCK(die_lock); +DEFINE_RAW_SPINLOCK(die_lock); /* * This function is protected against re-entrancy. @@ -252,7 +253,7 @@ void notify_die(const char *str, struct } static LIST_HEAD(undef_hook); -static DEFINE_SPINLOCK(undef_lock); +static DEFINE_RAW_SPINLOCK(undef_lock); void register_undef_hook(struct undef_hook *hook) { Index: linux/arch/arm/mach-footbridge/netwinder-hw.c =================================================================== --- linux.orig/arch/arm/mach-footbridge/netwinder-hw.c +++ linux/arch/arm/mach-footbridge/netwinder-hw.c @@ -67,7 +67,7 @@ static inline void wb977_ww(int reg, int /* * This is a lock for accessing ports GP1_IO_BASE and GP2_IO_BASE */ -DEFINE_SPINLOCK(gpio_lock); +DEFINE_RAW_SPINLOCK(gpio_lock); static unsigned int current_gpio_op; static unsigned int current_gpio_io; Index: linux/arch/arm/mach-footbridge/netwinder-leds.c =================================================================== --- linux.orig/arch/arm/mach-footbridge/netwinder-leds.c +++ linux/arch/arm/mach-footbridge/netwinder-leds.c @@ -32,7 +32,7 @@ static char led_state; static char hw_led_state; static DEFINE_SPINLOCK(leds_lock); -extern spinlock_t gpio_lock; +extern raw_spinlock_t gpio_lock; static void netwinder_leds_event(led_event_t evt) { Index: linux/arch/arm/mach-integrator/core.c =================================================================== --- linux.orig/arch/arm/mach-integrator/core.c +++ linux/arch/arm/mach-integrator/core.c @@ -164,7 +164,7 @@ static struct amba_pl010_data integrator #define CM_CTRL IO_ADDRESS(INTEGRATOR_HDR_BASE) + INTEGRATOR_HDR_CTRL_OFFSET -static DEFINE_SPINLOCK(cm_lock); +static DEFINE_RAW_SPINLOCK(cm_lock); /** * cm_control - update the CM_CTRL register. Index: linux/arch/arm/mach-integrator/pci_v3.c =================================================================== --- linux.orig/arch/arm/mach-integrator/pci_v3.c +++ linux/arch/arm/mach-integrator/pci_v3.c @@ -162,7 +162,7 @@ * 7:2 register number * */ -static DEFINE_SPINLOCK(v3_lock); +static DEFINE_RAW_SPINLOCK(v3_lock); #define PCI_BUS_NONMEM_START 0x00000000 #define PCI_BUS_NONMEM_SIZE SZ_256M Index: linux/arch/arm/mach-integrator/platsmp.c =================================================================== --- linux.orig/arch/arm/mach-integrator/platsmp.c +++ linux/arch/arm/mach-integrator/platsmp.c @@ -31,7 +31,7 @@ extern void integrator_secondary_startup volatile int __cpuinitdata pen_release = -1; unsigned long __cpuinitdata phys_pen_release = 0; -static DEFINE_SPINLOCK(boot_lock); +static DEFINE_RAW_SPINLOCK(boot_lock); void __cpuinit platform_secondary_init(unsigned int cpu) { Index: linux/arch/arm/mach-ixp4xx/common-pci.c =================================================================== --- linux.orig/arch/arm/mach-ixp4xx/common-pci.c +++ linux/arch/arm/mach-ixp4xx/common-pci.c @@ -53,7 +53,7 @@ unsigned long ixp4xx_pci_reg_base = 0; * these transactions are atomic or we will end up * with corrupt data on the bus or in a driver. */ -static DEFINE_SPINLOCK(ixp4xx_pci_lock); +static DEFINE_RAW_SPINLOCK(ixp4xx_pci_lock); /* * Read from PCI config space Index: linux/arch/arm/mach-omap1/pm.c =================================================================== --- linux.orig/arch/arm/mach-omap1/pm.c +++ linux/arch/arm/mach-omap1/pm.c @@ -120,7 +120,7 @@ void omap_pm_idle(void) local_irq_disable(); local_fiq_disable(); - if (need_resched()) { + if (need_resched() || need_resched_delayed()) { local_fiq_enable(); local_irq_enable(); return; Index: linux/arch/arm/mach-omap2/pm.c =================================================================== --- linux.orig/arch/arm/mach-omap2/pm.c +++ linux/arch/arm/mach-omap2/pm.c @@ -53,7 +53,7 @@ void omap2_pm_idle(void) { local_irq_disable(); local_fiq_disable(); - if (need_resched()) { + if (need_resched() || need_resched_delayed()) { local_fiq_enable(); local_irq_enable(); return; Index: linux/arch/arm/mach-sa1100/badge4.c =================================================================== --- linux.orig/arch/arm/mach-sa1100/badge4.c +++ linux/arch/arm/mach-sa1100/badge4.c @@ -240,15 +240,22 @@ void badge4_set_5V(unsigned subsystem, i /* detect on->off and off->on transitions */ if ((!old_5V_bitmap) && (badge4_5V_bitmap)) { /* was off, now on */ - printk(KERN_INFO "%s: enabling 5V supply rail\n", __FUNCTION__); GPSR = BADGE4_GPIO_PCMEN5V; } else if ((old_5V_bitmap) && (!badge4_5V_bitmap)) { /* was on, now off */ - printk(KERN_INFO "%s: disabling 5V supply rail\n", __FUNCTION__); GPCR = BADGE4_GPIO_PCMEN5V; } local_irq_restore(flags); + + /* detect on->off and off->on transitions */ + if ((!old_5V_bitmap) && (badge4_5V_bitmap)) { + /* was off, now on */ + printk(KERN_INFO "%s: enabling 5V supply rail\n", __FUNCTION__); + } else if ((old_5V_bitmap) && (!badge4_5V_bitmap)) { + /* was on, now off */ + printk(KERN_INFO "%s: disabling 5V supply rail\n", __FUNCTION__); + } } EXPORT_SYMBOL(badge4_set_5V); Index: linux/arch/arm/mach-shark/leds.c =================================================================== --- linux.orig/arch/arm/mach-shark/leds.c +++ linux/arch/arm/mach-shark/leds.c @@ -32,7 +32,7 @@ static char led_state; static short hw_led_state; static short saved_state; -static DEFINE_SPINLOCK(leds_lock); +static DEFINE_RAW_SPINLOCK(leds_lock); short sequoia_read(int addr) { outw(addr,0x24); Index: linux/arch/arm/mm/consistent.c =================================================================== --- linux.orig/arch/arm/mm/consistent.c +++ linux/arch/arm/mm/consistent.c @@ -40,7 +40,7 @@ * These are the page tables (2MB each) covering uncached, DMA consistent allocations */ static pte_t *consistent_pte[NUM_CONSISTENT_PTES]; -static DEFINE_SPINLOCK(consistent_lock); +static DEFINE_RAW_SPINLOCK(consistent_lock); /* * VM region handling support. Index: linux/arch/arm/mm/copypage-v4mc.c =================================================================== --- linux.orig/arch/arm/mm/copypage-v4mc.c +++ linux/arch/arm/mm/copypage-v4mc.c @@ -29,7 +29,7 @@ #define minicache_pgprot __pgprot(L_PTE_PRESENT | L_PTE_YOUNG | \ L_PTE_CACHEABLE) -static DEFINE_SPINLOCK(minicache_lock); +static DEFINE_RAW_SPINLOCK(minicache_lock); /* * ARMv4 mini-dcache optimised copy_user_page @@ -43,7 +43,7 @@ static DEFINE_SPINLOCK(minicache_lock); * instruction. If your processor does not supply this, you have to write your * own copy_user_page that does the right thing. */ -static void __attribute__((naked)) +static void notrace __attribute__((naked)) mc_copy_user_page(void *from, void *to) { asm volatile( @@ -82,7 +82,7 @@ void v4_mc_copy_user_page(void *kto, con /* * ARMv4 optimised clear_user_page */ -void __attribute__((naked)) +void notrace __attribute__((naked)) v4_mc_clear_user_page(void *kaddr, unsigned long vaddr) { asm volatile( Index: linux/arch/arm/mm/copypage-v6.c =================================================================== --- linux.orig/arch/arm/mm/copypage-v6.c +++ linux/arch/arm/mm/copypage-v6.c @@ -26,7 +26,7 @@ #define from_address (0xffff8000) #define to_address (0xffffc000) -static DEFINE_SPINLOCK(v6_lock); +static DEFINE_RAW_SPINLOCK(v6_lock); /* * Copy the user page. No aliasing to deal with so we can just Index: linux/arch/arm/mm/copypage-xscale.c =================================================================== --- linux.orig/arch/arm/mm/copypage-xscale.c +++ linux/arch/arm/mm/copypage-xscale.c @@ -31,7 +31,7 @@ #define minicache_pgprot __pgprot(L_PTE_PRESENT | L_PTE_YOUNG | \ L_PTE_CACHEABLE) -static DEFINE_SPINLOCK(minicache_lock); +static DEFINE_RAW_SPINLOCK(minicache_lock); /* * XScale mini-dcache optimised copy_user_page @@ -41,7 +41,7 @@ static DEFINE_SPINLOCK(minicache_lock); * Dcache aliasing issue. The writes will be forwarded to the write buffer, * and merged as appropriate. */ -static void __attribute__((naked)) +static void notrace __attribute__((naked)) mc_copy_user_page(void *from, void *to) { /* @@ -104,7 +104,7 @@ void xscale_mc_copy_user_page(void *kto, /* * XScale optimised clear_user_page */ -void __attribute__((naked)) +void notrace __attribute__((naked)) xscale_mc_clear_user_page(void *kaddr, unsigned long vaddr) { asm volatile( Index: linux/arch/arm/mm/fault.c =================================================================== --- linux.orig/arch/arm/mm/fault.c +++ linux/arch/arm/mm/fault.c @@ -216,7 +216,7 @@ out: return fault; } -static int +static notrace int do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { struct task_struct *tsk; @@ -316,7 +316,7 @@ no_context: * interrupt or a critical region, and should only copy the information * from the master page table, nothing more. */ -static int +static notrace int do_translation_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { @@ -359,7 +359,7 @@ bad_area: * Some section permission faults need to be handled gracefully. * They can happen due to a __{get,put}_user during an oops. */ -static int +static notrace int do_sect_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { do_bad_area(addr, fsr, regs); @@ -369,7 +369,7 @@ do_sect_fault(unsigned long addr, unsign /* * This abort handler always returns "fault". */ -static int +static notrace int do_bad(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { return 1; @@ -424,7 +424,7 @@ static struct fsr_info { { do_bad, SIGBUS, 0, "unknown 31" } }; -void __init +void __init notrace hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *), int sig, const char *name) { @@ -438,7 +438,7 @@ hook_fault_code(int nr, int (*fn)(unsign /* * Dispatch a data abort to the relevant handler. */ -asmlinkage void +asmlinkage notrace void do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { const struct fsr_info *inf = fsr_info + (fsr & 15) + ((fsr & (1 << 10)) >> 6); @@ -457,7 +457,7 @@ do_DataAbort(unsigned long addr, unsigne notify_die("", regs, &info, fsr, 0); } -asmlinkage void +asmlinkage notrace void do_PrefetchAbort(unsigned long addr, struct pt_regs *regs) { do_translation_fault(addr, 0, regs); Index: linux/arch/arm/mm/mmu.c =================================================================== --- linux.orig/arch/arm/mm/mmu.c +++ linux/arch/arm/mm/mmu.c @@ -25,7 +25,7 @@ #include "mm.h" -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); extern void _stext, _etext, __data_start, _end; extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; Index: linux/arch/arm/oprofile/op_model_xscale.c =================================================================== --- linux.orig/arch/arm/oprofile/op_model_xscale.c +++ linux/arch/arm/oprofile/op_model_xscale.c @@ -381,8 +381,9 @@ static int xscale_pmu_start(void) { int ret; u32 pmnc = read_pmnc(); + int irq_flags = IRQF_DISABLED | IRQF_NODELAY; - ret = request_irq(XSCALE_PMU_IRQ, xscale_pmu_interrupt, IRQF_DISABLED, + ret = request_irq(XSCALE_PMU_IRQ, xscale_pmu_interrupt, irq_flags, "XScale PMU", (void *)results); if (ret < 0) { Index: linux/arch/arm/plat-omap/clock.c =================================================================== --- linux.orig/arch/arm/plat-omap/clock.c +++ linux/arch/arm/plat-omap/clock.c @@ -29,7 +29,7 @@ static LIST_HEAD(clocks); static DEFINE_MUTEX(clocks_mutex); -static DEFINE_SPINLOCK(clockfw_lock); +static DEFINE_RAW_SPINLOCK(clockfw_lock); static struct clk_functions *arch_clock; Index: linux/arch/arm/plat-omap/dma.c =================================================================== --- linux.orig/arch/arm/plat-omap/dma.c +++ linux/arch/arm/plat-omap/dma.c @@ -990,7 +990,7 @@ static struct irqaction omap24xx_dma_irq /*----------------------------------------------------------------------------*/ static struct lcd_dma_info { - spinlock_t lock; + raw_spinlock_t lock; int reserved; void (* callback)(u16 status, void *data); void *cb_data; Index: linux/arch/arm/plat-omap/gpio.c =================================================================== --- linux.orig/arch/arm/plat-omap/gpio.c +++ linux/arch/arm/plat-omap/gpio.c @@ -120,7 +120,7 @@ struct gpio_bank { u32 reserved_map; u32 suspend_wakeup; u32 saved_wakeup; - spinlock_t lock; + raw_spinlock_t lock; }; #define METHOD_MPUIO 0 Index: linux/arch/arm/plat-omap/mux.c =================================================================== --- linux.orig/arch/arm/plat-omap/mux.c +++ linux/arch/arm/plat-omap/mux.c @@ -56,7 +56,7 @@ int __init omap_mux_register(struct pin_ */ int __init_or_module omap_cfg_reg(const unsigned long index) { - static DEFINE_SPINLOCK(mux_spin_lock); + static DEFINE_RAW_SPINLOCK(mux_spin_lock); unsigned long flags; struct pin_config *cfg; Index: linux/arch/i386/Kconfig =================================================================== --- linux.orig/arch/i386/Kconfig +++ linux/arch/i386/Kconfig @@ -18,6 +18,14 @@ config GENERIC_TIME bool default y +config GENERIC_CLOCKEVENTS + bool + default y + +config GENERIC_CLOCKEVENTS_BROADCAST + bool + default y + config LOCKDEP_SUPPORT bool default y @@ -65,6 +73,8 @@ source "init/Kconfig" menu "Processor type and features" +source "kernel/time/Kconfig" + config SMP bool "Symmetric multi-processing support" ---help--- @@ -260,6 +270,19 @@ config SCHED_MC source "kernel/Kconfig.preempt" +config RWSEM_GENERIC_SPINLOCK + bool + depends on M386 || PREEMPT_RT + default y + +config ASM_SEMAPHORES + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + bool + default y if !RWSEM_GENERIC_SPINLOCK + config X86_UP_APIC bool "Local APIC support on uniprocessors" depends on !SMP && !(X86_VISWS || X86_VOYAGER || X86_GENERICARCH) @@ -712,6 +735,7 @@ config BOOT_IOREMAP config REGPARM bool "Use register arguments" + depends on !MCOUNT default y help Compile the kernel with -mregparm=3. This instructs gcc to use @@ -814,6 +838,10 @@ config COMPAT_VDSO If unsure, say Y. +config GENERIC_TIME_VSYSCALL + depends on EXPERIMENTAL + bool "VSYSCALL gettimeofday() interface" + endmenu config ARCH_ENABLE_MEMORY_HOTPLUG Index: linux/arch/i386/Kconfig.cpu =================================================================== --- linux.orig/arch/i386/Kconfig.cpu +++ linux/arch/i386/Kconfig.cpu @@ -236,11 +236,6 @@ config RWSEM_GENERIC_SPINLOCK depends on M386 default y -config RWSEM_XCHGADD_ALGORITHM - bool - depends on !M386 - default y - config GENERIC_CALIBRATE_DELAY bool default y Index: linux/arch/i386/Kconfig.debug =================================================================== --- linux.orig/arch/i386/Kconfig.debug +++ linux/arch/i386/Kconfig.debug @@ -22,6 +22,7 @@ config EARLY_PRINTK config DEBUG_STACKOVERFLOW bool "Check for stack overflows" depends on DEBUG_KERNEL + default y help This option will cause messages to be printed if free stack space drops below a certain limit. @@ -29,6 +30,7 @@ config DEBUG_STACKOVERFLOW config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL + default y help Enables the display of the minimum amount of free stack which each task has ever had available in the sysrq-T and sysrq-P debug output. @@ -49,6 +51,7 @@ config DEBUG_PAGEALLOC config DEBUG_RODATA bool "Write protect kernel read-only data structures" depends on DEBUG_KERNEL + default y help Mark the kernel read-only data as write-protected in the pagetables, in order to catch accidental (and incorrect) writes to such const @@ -59,6 +62,7 @@ config DEBUG_RODATA config 4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL + default y help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates Index: linux/arch/i386/boot/compressed/misc.c =================================================================== --- linux.orig/arch/i386/boot/compressed/misc.c +++ linux/arch/i386/boot/compressed/misc.c @@ -15,6 +15,12 @@ #include #include +#ifdef CONFIG_MCOUNT +void notrace mcount(void) +{ +} +#endif + /* * gzip declarations */ @@ -107,7 +113,7 @@ static long free_mem_end_ptr; #define INPLACE_MOVE_ROUTINE 0x1000 #define LOW_BUFFER_START 0x2000 #define LOW_BUFFER_MAX 0x90000 -#define HEAP_SIZE 0x3000 +#define HEAP_SIZE 0x4000 static unsigned int low_buffer_end, low_buffer_size; static int high_loaded =0; static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/; Index: linux/arch/i386/kernel/Makefile =================================================================== --- linux.orig/arch/i386/kernel/Makefile +++ linux/arch/i386/kernel/Makefile @@ -12,14 +12,16 @@ obj-y := process.o signal.o entry.o trap obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += cpu/ obj-y += acpi/ +obj-$(CONFIG_GENERIC_TIME_VSYSCALL) += vsyscall-gtod.o obj-$(CONFIG_X86_BIOS_REBOOT) += reboot.o obj-$(CONFIG_MCA) += mca.o obj-$(CONFIG_X86_MSR) += msr.o obj-$(CONFIG_X86_CPUID) += cpuid.o obj-$(CONFIG_MICROCODE) += microcode.o obj-$(CONFIG_APM) += apm.o -obj-$(CONFIG_X86_SMP) += smp.o smpboot.o +obj-$(CONFIG_X86_SMP) += smp.o smpboot.o tsc_sync.o obj-$(CONFIG_X86_TRAMPOLINE) += trampoline.o +obj-$(CONFIG_MCOUNT) += mcount-wrapper.o obj-$(CONFIG_X86_MPPARSE) += mpparse.o obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o obj-$(CONFIG_X86_IO_APIC) += io_apic.o Index: linux/arch/i386/kernel/acpi/boot.c =================================================================== --- linux.orig/arch/i386/kernel/acpi/boot.c +++ linux/arch/i386/kernel/acpi/boot.c @@ -25,6 +25,7 @@ #include #include +#include #include #include #include @@ -638,6 +639,7 @@ static int __init acpi_parse_sbf(unsigne } #ifdef CONFIG_HPET_TIMER +#include static int __init acpi_parse_hpet(unsigned long phys, unsigned long size) { @@ -672,24 +674,16 @@ static int __init acpi_parse_hpet(unsign } #ifdef CONFIG_X86_64 - vxtime.hpet_address = hpet_tbl->addr.addrl | + hpet_address = hpet_tbl->addr.addrl | ((long)hpet_tbl->addr.addrh << 32); - printk(KERN_INFO PREFIX "HPET id: %#x base: %#lx\n", - hpet_tbl->id, vxtime.hpet_address); - - res_start = vxtime.hpet_address; #else /* X86 */ - { - extern unsigned long hpet_address; + hpet_address = hpet_tbl->addr.addrl; - hpet_address = hpet_tbl->addr.addrl; - printk(KERN_INFO PREFIX "HPET id: %#x base: %#lx\n", - hpet_tbl->id, hpet_address); - - res_start = hpet_address; - } #endif /* X86 */ + printk(KERN_INFO PREFIX "HPET id: %#x base: %#lx\n", + hpet_tbl->id, hpet_address); + res_start = hpet_address; if (hpet_res) { hpet_res->start = res_start; @@ -703,10 +697,6 @@ static int __init acpi_parse_hpet(unsign #define acpi_parse_hpet NULL #endif -#ifdef CONFIG_X86_PM_TIMER -extern u32 pmtmr_ioport; -#endif - static int __init acpi_parse_fadt(unsigned long phys, unsigned long size) { struct fadt_descriptor *fadt = NULL; @@ -931,7 +921,7 @@ static void __init acpi_process_madt(voi acpi_ioapic = 1; smp_found_config = 1; - clustered_apic_check(); + setup_apic_routing(); } } if (error == -EINVAL) { Index: linux/arch/i386/kernel/apic.c =================================================================== --- linux.orig/arch/i386/kernel/apic.c +++ linux/arch/i386/kernel/apic.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include #include @@ -44,128 +46,603 @@ #include "io_ports.h" /* - * cpu_mask that denotes the CPUs that needs timer interrupt coming in as - * IPIs in place of local APIC timers + * Sanity check */ -static cpumask_t timer_bcast_ipi; +#if (SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F +# error SPURIOUS_APIC_VECTOR definition error +#endif /* * Knob to control our willingness to enable the local APIC. + * + * -1=force-disable, +1=force-enable */ -static int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */ - -static inline void lapic_disable(void) -{ - enable_local_apic = -1; - clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); -} +static int enable_local_apic __initdata = 0; -static inline void lapic_enable(void) -{ - enable_local_apic = 1; -} +/* Enable local APIC timer for highres/dyntick on UP */ +static int enable_local_apic_timer __initdata = 0; /* - * Debug level + * Debug level, exported for io_apic.c */ int apic_verbosity; +static unsigned int calibration_result; +static void lapic_next_event(unsigned long delta, + struct clock_event_device *evt); +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt); +static void lapic_timer_broadcast(cpumask_t *mask); static void apic_pm_activate(void); +/* + * The local apic timer can be used for any function which is CPU local. + */ +static struct clock_event_device lapic_clockevent = { + .name = "lapic", + .capabilities = CLOCK_CAP_PROFILE +#ifdef CONFIG_SMP + /* + * On UP we keep update_process_times() on the PIT interrupt to + * resemble the original behaviour as close as possible. SMP + * requires to run this CPU local. + */ + | CLOCK_CAP_UPDATE +#endif + , + .shift = 32, + .set_mode = lapic_timer_setup, + .set_next_event = lapic_next_event, +}; +static DEFINE_PER_CPU(struct clock_event_device, lapic_events); + +/* Local APIC was disabled by the BIOS and enabled by the kernel */ +static int enabled_via_apicbase; + +/* + * Get the LAPIC version + */ +static inline int lapic_get_version(void) +{ + return GET_APIC_VERSION(apic_read(APIC_LVR)); +} + +/* + * Check, if the APIC is integrated or a seperate chip + */ +static inline int lapic_is_integrated(void) +{ + return APIC_INTEGRATED(lapic_get_version()); +} + +/* + * Check, whether this is a modern or a first generation APIC + */ static int modern_apic(void) { - unsigned int lvr, version; /* AMD systems use old APIC versions, so check the CPU */ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && - boot_cpu_data.x86 >= 0xf) + boot_cpu_data.x86 >= 0xf) return 1; - lvr = apic_read(APIC_LVR); - version = GET_APIC_VERSION(lvr); - return version >= 0x14; + return lapic_get_version() >= 0x14; +} + +/** + * enable_NMI_through_LVT0 - enable NMI through local vector table 0 + */ +void enable_NMI_through_LVT0 (void * dummy) +{ + unsigned int v = APIC_DM_NMI; + + /* Level triggered for 82489DX */ + if (!lapic_is_integrated()) + v |= APIC_LVT_LEVEL_TRIGGER; + apic_write_around(APIC_LVT0, v); +} + +/** + * get_physical_broadcast - Get number of physical broadcast IDs + */ +int get_physical_broadcast(void) +{ + return modern_apic() ? 0xff : 0xf; +} + +/** + * lapic_get_maxlvt - get the maximum number of local vector table entries + */ +int lapic_get_maxlvt(void) +{ + unsigned int v = apic_read(APIC_LVR); + + /* 82489DXs do not report # of LVT entries. */ + return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2; } /* - * 'what should we do if we get a hw irq event on an illegal vector'. - * each architecture has to answer this themselves. + * Local APIC timer + */ + +/* Clock divisor is set to 16 */ +#define APIC_DIVISOR 16 + +/* + * This function sets up the local APIC timer, with a timeout of + * 'clocks' APIC bus clock. During calibration we actually call + * this function twice on the boot CPU, once with a bogus timeout + * value, second time for real. The other (noncalibrating) CPUs + * call this function only once, with the real, calibrated value. + * + * We do reads before writes even if unnecessary, to get around the + * P5 APIC double write bug. */ -void ack_bad_irq(unsigned int irq) +static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) { - printk("unexpected IRQ trap at vector %02x\n", irq); + unsigned int lvtt_value, tmp_value; + + lvtt_value = LOCAL_TIMER_VECTOR; + if (!oneshot) + lvtt_value |= APIC_LVT_TIMER_PERIODIC; + if (!lapic_is_integrated()) + lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); + + if (!irqen) + lvtt_value |= APIC_LVT_MASKED; + + apic_write_around(APIC_LVTT, lvtt_value); + /* - * Currently unexpected vectors happen only on SMP and APIC. - * We _must_ ack these because every local APIC has only N - * irq slots per priority level, and a 'hanging, unacked' IRQ - * holds up an irq slot - in excessive cases (when multiple - * unexpected vectors occur) that might lock up the APIC - * completely. - * But only ack when the APIC is enabled -AK + * Divide PICLK by 16 */ - if (cpu_has_apic) - ack_APIC_irq(); + tmp_value = apic_read(APIC_TDCR); + apic_write_around(APIC_TDCR, (tmp_value + & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) + | APIC_TDR_DIV_16); + + if (!oneshot) + apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR); } -void __init apic_intr_init(void) +/* + * Program the next event, relative to now + */ +static void lapic_next_event(unsigned long delta, + struct clock_event_device *evt) { -#ifdef CONFIG_SMP - smp_intr_init(); -#endif - /* self generated IPI for local APIC timer */ - set_intr_gate(LOCAL_TIMER_VECTOR, apic_timer_interrupt); + apic_write_around(APIC_TMICT, delta); +} - /* IPI vectors for APIC spurious and error interrupts */ - set_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt); - set_intr_gate(ERROR_APIC_VECTOR, error_interrupt); +/* + * Setup the lapic timer in periodic or oneshot mode + */ +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt) +{ + unsigned long flags; + unsigned int v; - /* thermal monitor LVT interrupt */ -#ifdef CONFIG_X86_MCE_P4THERMAL - set_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt); -#endif + local_irq_save(flags); + + switch (mode) { + case CLOCK_EVT_PERIODIC: + case CLOCK_EVT_ONESHOT: + __setup_APIC_LVTT(calibration_result, + mode != CLOCK_EVT_PERIODIC, 1); + break; + case CLOCK_EVT_SHUTDOWN: + v = apic_read(APIC_LVTT); + v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); + apic_write_around(APIC_LVTT, v); + break; + } + + local_irq_restore(flags); } -/* Using APIC to generate smp_local_timer_interrupt? */ -int using_apic_timer __read_mostly = 0; +/* + * Setup the local APIC timer for this CPU. Copy the initilized values + * of the boot CPU and register the clock event in the framework. + */ +static void __devinit setup_APIC_timer(void) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); -static int enabled_via_apicbase; + memcpy(levt, &lapic_clockevent, sizeof(*levt)); -void enable_NMI_through_LVT0 (void * dummy) + register_local_clockevent(levt); +} + +/* + * In this functions we calibrate APIC bus clocks to the external timer. + * + * We want to do the calibration only once since we want to have local timer + * irqs syncron. CPUs connected by the same APIC bus have the very same bus + * frequency. + * + * This was previously done by reading the PIT/HPET and waiting for a wrap + * around to find out, that a tick has elapsed. I have a box, where the PIT + * readout is broken, so it never gets out of the wait loop again. This was + * also reported by others. + * + * Monitoring the jiffies value is inaccurate and the clockevents + * infrastructure allows us to do a simple substitution of the interrupt + * handler. + * + * The calibration routine also uses the pm_timer when possible, as the PIT + * happens to run way too slow (factor 2.3 on my VAIO CoreDuo, which goes + * back to normal later in the boot process). + */ + +#define LAPIC_CAL_LOOPS (HZ/10) + +static __initdata volatile int lapic_cal_loops = -1; +static __initdata long lapic_cal_t1, lapic_cal_t2; +static __initdata unsigned long long lapic_cal_tsc1, lapic_cal_tsc2; +static __initdata unsigned long lapic_cal_pm1, lapic_cal_pm2; +static __initdata unsigned long lapic_cal_j1, lapic_cal_j2; + +/* + * Temporary interrupt handler. + */ +static void __init lapic_cal_handler(struct pt_regs *regs) { - unsigned int v, ver; + unsigned long long tsc = 0; + long tapic = apic_read(APIC_TMCCT); + unsigned long pm = acpi_pm_read_early(); - ver = apic_read(APIC_LVR); - ver = GET_APIC_VERSION(ver); - v = APIC_DM_NMI; /* unmask and set to NMI */ - if (!APIC_INTEGRATED(ver)) /* 82489DX */ - v |= APIC_LVT_LEVEL_TRIGGER; - apic_write_around(APIC_LVT0, v); + if (cpu_has_tsc) + rdtscll(tsc); + + switch (lapic_cal_loops++) { + case 0: + lapic_cal_t1 = tapic; + lapic_cal_tsc1 = tsc; + lapic_cal_pm1 = pm; + lapic_cal_j1 = jiffies; + break; + + case LAPIC_CAL_LOOPS: + lapic_cal_t2 = tapic; + lapic_cal_tsc2 = tsc; + if (pm < lapic_cal_pm1) + pm += ACPI_PM_OVRRUN; + lapic_cal_pm2 = pm; + lapic_cal_j2 = jiffies; + break; + } } -int get_physical_broadcast(void) +/* + * Setup the boot APIC + * + * Calibrate and verify the result. + */ +void __init setup_boot_APIC_clock(void) { - if (modern_apic()) - return 0xff; - else - return 0xf; + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + const long pm_100ms = PMTMR_TICKS_PER_SEC/10; + const long pm_thresh = pm_100ms/100; + void (*real_handler)(struct pt_regs *regs); + unsigned long deltaj; + long delta, deltapm; + cpumask_t cpumask; + + apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n" + "calibrating APIC timer ...\n"); + + /* Register broadcast function */ + clockevents_register_broadcast(lapic_timer_broadcast); + + /* + * Enable the apic timer next event capability only for + * SMP and on UP, when requested via commandline + */ + if (num_possible_cpus() > 1 || enable_local_apic_timer) + lapic_clockevent.capabilities |= CLOCK_CAP_NEXTEVT; + + local_irq_disable(); + + /* Replace the global interrupt handler */ + real_handler = global_clock_event->event_handler; + global_clock_event->event_handler = lapic_cal_handler; + + /* + * Setup the APIC counter to 1e9. There is no way the lapic + * can underflow in the 100ms detection time frame + */ + __setup_APIC_LVTT(1000000000, 0, 0); + + /* Let the interrupts run */ + local_irq_enable(); + + while(lapic_cal_loops <= LAPIC_CAL_LOOPS); + + local_irq_disable(); + + /* Restore the real event handler */ + global_clock_event->event_handler = real_handler; + + /* Build delta t1-t2 as apic timer counts down */ + delta = lapic_cal_t1 - lapic_cal_t2; + apic_printk(APIC_VERBOSE, "... lapic delta = %ld\n", delta); + + /* Check, if the PM timer is available */ + deltapm = lapic_cal_pm2 - lapic_cal_pm1; + apic_printk(APIC_VERBOSE, "... PM timer delta = %ld\n", deltapm); + + if (deltapm) { + unsigned long mult; + u64 res; + + mult = clocksource_hz2mult(PMTMR_TICKS_PER_SEC, 22); + + if (deltapm > (pm_100ms - pm_thresh) && + deltapm < (pm_100ms + pm_thresh)) { + apic_printk(APIC_VERBOSE, "... PM timer result ok\n"); + } else { + res = (((u64) deltapm) * mult) >> 22; + do_div(res, 1000000); + printk(KERN_WARNING "APIC calibration not consistent " + "with PM Timer: %ldms instead of 100ms\n", + (long)res); + /* Correct the lapic counter value */ + res = (((u64) delta ) * pm_100ms); + do_div(res, deltapm); + printk(KERN_INFO "APIC delta adjusted to PM-Timer: " + "%lu (%ld)\n", (unsigned long) res, delta); + delta = (long) res; + } + } + + /* Calculate the scaled math multiplication factor */ + lapic_clockevent.mult = div_sc(delta, TICK_NSEC * LAPIC_CAL_LOOPS, 32); + lapic_clockevent.max_delta_ns = + clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); + lapic_clockevent.min_delta_ns = + clockevent_delta2ns(0xF, &lapic_clockevent); + + calibration_result = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; + + apic_printk(APIC_VERBOSE, "..... delta %ld\n", delta); + apic_printk(APIC_VERBOSE, "..... mult: %ld\n", lapic_clockevent.mult); + apic_printk(APIC_VERBOSE, "..... calibration result: %u\n", + calibration_result); + + if (cpu_has_tsc) { + delta = (long)(lapic_cal_tsc2 - lapic_cal_tsc1); + apic_printk(APIC_VERBOSE, "..... CPU clock speed is " + "%ld.%04ld MHz.\n", + (delta / LAPIC_CAL_LOOPS) / (1000000 / HZ), + (delta / LAPIC_CAL_LOOPS) % (1000000 / HZ)); + } + + apic_printk(APIC_VERBOSE, "..... host bus clock speed is " + "%u.%04u MHz.\n", + calibration_result / (1000000 / HZ), + calibration_result % (1000000 / HZ)); + + + apic_printk(APIC_VERBOSE, "... verify APIC timer\n"); + + /* + * Start LAPIC timer and verify that the calculated factor is correct + */ + setup_APIC_timer(); + + /* Replace the lapic interrupt handler */ + real_handler = levt->event_handler; + levt->event_handler = lapic_cal_handler; + lapic_cal_loops = -1; + + /* Let the interrupts run */ + local_irq_enable(); + + while(lapic_cal_loops <= LAPIC_CAL_LOOPS); + + local_irq_disable(); + + /* Restore the real event handler */ + levt->event_handler = real_handler; + + local_irq_enable(); + + /* Jiffies delta */ + deltaj = lapic_cal_j2 - lapic_cal_j1; + apic_printk(APIC_VERBOSE, "... jiffies delta = %lu\n", deltaj); + + /* Check, if the PM timer is available */ + deltapm = lapic_cal_pm2 - lapic_cal_pm1; + apic_printk(APIC_VERBOSE, "... PM timer delta = %ld\n", deltapm); + + if (deltapm) { + if (deltapm > (pm_100ms - pm_thresh) && + deltapm < (pm_100ms + pm_thresh)) { + apic_printk(APIC_VERBOSE, "... PM timer result ok\n"); + /* Check, if the jiffies result is consistent */ + if (deltaj < LAPIC_CAL_LOOPS-2 || + deltaj > LAPIC_CAL_LOOPS+2) { + /* + * Not sure, what we can do about this one. + * When high resultion timers are active + * and the lapic timer does not stop in C3 + * we are fine. Otherwise more trouble might + * be waiting. -- tglx + */ + printk(KERN_WARNING "Global event device %s " + "has wrong frequency " + "(%lu ticks instead of %d)\n", + global_clock_event->name, deltaj, + LAPIC_CAL_LOOPS); + } + return; + } + } else { + /* Check, if the jiffies result is consistent */ + if (deltaj >= LAPIC_CAL_LOOPS-2 && + deltaj <= LAPIC_CAL_LOOPS+2) { + apic_printk(APIC_VERBOSE, "... jiffies result ok\n"); + return; + } + } + + printk(KERN_WARNING + "APIC timer disabled due to verification failure.\n"); + local_irq_disable(); + cpumask = cpumask_of_cpu(smp_processor_id()); + switch_APIC_timer_to_ipi(&cpumask); + local_irq_enable(); } -int get_maxlvt(void) +void __devinit setup_secondary_APIC_clock(void) { - unsigned int v, ver, maxlvt; + setup_APIC_timer(); +} - v = apic_read(APIC_LVR); - ver = GET_APIC_VERSION(v); - /* 82489DXs do not report # of LVT entries. */ - maxlvt = APIC_INTEGRATED(ver) ? GET_APIC_MAXLVT(v) : 2; - return maxlvt; +void switch_APIC_timer_to_ipi(void *cpumask) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + cpumask_t mask = *(cpumask_t *)cpumask; + int cpu = smp_processor_id(); + + if (cpu_isset(cpu, mask) && levt->event_handler) + clockevents_set_global_broadcast(levt, 1); +} +EXPORT_SYMBOL(switch_APIC_timer_to_ipi); + +void switch_ipi_to_APIC_timer(void *cpumask) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + cpumask_t mask = *(cpumask_t *)cpumask; + int cpu = smp_processor_id(); + + if (cpu_isset(cpu, mask) && levt->event_handler) + clockevents_set_global_broadcast(levt, 0); } +EXPORT_SYMBOL(switch_ipi_to_APIC_timer); + +/* + * The guts of the apic timer interrupt + */ +static fastcall void local_apic_timer_interrupt(struct pt_regs *regs) +{ + int cpu = smp_processor_id(); + struct clock_event_device *evt = &per_cpu(lapic_events, cpu); + /* + * Normally we should not be here till LAPIC has been initialized but + * in some cases like kdump, its possible that there is a pending LAPIC + * timer interrupt from previous kernel's context and is delivered in + * new kernel the moment interrupts are enabled. + * + * Interrupts are enabled early and LAPIC is setup much later, hence + * its possible that when we get here evt->event_handler is NULL. + * Check for event_handler being NULL and discard the interrupt as + * spurious. + */ + if (!evt->event_handler) { + printk(KERN_WARNING + "Spurious LAPIC timer interrupt on cpu %d\n", cpu); + return; + } + + per_cpu(irq_stat, cpu).apic_timer_irqs++; + + /* + * If the task is currently running in user mode, don't + * detect soft lockups. If CONFIG_DETECT_SOFTLOCKUP is not + * configured, this should be optimized out. + */ + if (user_mode(regs)) + touch_softlockup_watchdog(); + + evt->event_handler(regs); +} + +/* + * Local APIC timer interrupt. This is the most natural way for doing + * local interrupts, but local timer interrupts can be emulated by + * broadcast interrupts too. [in case the hw doesn't support APIC timers] + * + * [ if a single-CPU system runs an SMP kernel then we call the local + * interrupt as well. Thus we cannot inline the local irq ... ] + */ + +fastcall notrace void smp_apic_timer_interrupt(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + trace_special(regs->eip, 1, 0); + + /* + * NOTE! We'd better ACK the irq immediately, + * because timer handling can be slow. + */ + ack_APIC_irq(); + /* + * update_process_times() expects us to have done irq_enter(). + * Besides, if we don't timer interrupts ignore the global + * interrupt lock, which is the WrongThing (tm) to do. + */ + irq_enter(); + local_apic_timer_interrupt(regs); + irq_exit(); + set_irq_regs(old_regs); +} + +/* + * Local APIC timer broadcast function + */ +static void lapic_timer_broadcast(cpumask_t *cpumask) +{ + int cpu = smp_processor_id(); + cpumask_t mask; + + cpus_and(mask, cpu_online_map, *cpumask); + if (cpu_isset(cpu, mask)) { + cpu_clear(cpu, mask); + local_apic_timer_interrupt(get_irq_regs()); + } +#ifdef CONFIG_SMP + if (!cpus_empty(mask)) + send_IPI_mask(mask, LOCAL_TIMER_VECTOR); +#endif +} + +/* + * Local APIC set next event broadcast + */ +void lapic_timer_idle_broadcast(int broadcast) +{ + int cpu = smp_processor_id(); + struct clock_event_device *evt = &per_cpu(lapic_events, cpu); + + if (evt->event_handler) + clockevents_set_broadcast(evt, broadcast); +} +EXPORT_SYMBOL_GPL(lapic_timer_idle_broadcast); + +int setup_profiling_timer(unsigned int multiplier) +{ + return -EINVAL; +} + +/* + * Local APIC start and shutdown + */ + +/** + * clear_local_APIC - shutdown the local APIC + * + * This is called, when a CPU is disabled and before rebooting, so the state of + * the local APIC has no dangling leftovers. Also used to cleanout any BIOS + * leftovers during boot. + */ void clear_local_APIC(void) { - int maxlvt; + int maxlvt = lapic_get_maxlvt(); unsigned long v; - maxlvt = get_maxlvt(); - /* * Masking an LVT entry can trigger a local APIC error * if the vector is zero. Mask LVTERR first to prevent this. @@ -189,7 +666,7 @@ void clear_local_APIC(void) apic_write_around(APIC_LVTPC, v | APIC_LVT_MASKED); } -/* lets not touch this if we didn't frob it */ + /* lets not touch this if we didn't frob it */ #ifdef CONFIG_X86_MCE_P4THERMAL if (maxlvt >= 5) { v = apic_read(APIC_LVTTHMR); @@ -211,85 +688,18 @@ void clear_local_APIC(void) if (maxlvt >= 5) apic_write_around(APIC_LVTTHMR, APIC_LVT_MASKED); #endif - v = GET_APIC_VERSION(apic_read(APIC_LVR)); - if (APIC_INTEGRATED(v)) { /* !82489DX */ - if (maxlvt > 3) /* Due to Pentium errata 3AP and 11AP. */ + /* Integrated APIC (!82489DX) ? */ + if (lapic_is_integrated()) { + if (maxlvt > 3) + /* Clear ESR due to Pentium errata 3AP and 11AP */ apic_write(APIC_ESR, 0); apic_read(APIC_ESR); } } -void __init connect_bsp_APIC(void) -{ - if (pic_mode) { - /* - * Do not trust the local APIC being empty at bootup. - */ - clear_local_APIC(); - /* - * PIC mode, enable APIC mode in the IMCR, i.e. - * connect BSP's local APIC to INT and NMI lines. - */ - apic_printk(APIC_VERBOSE, "leaving PIC mode, " - "enabling APIC mode.\n"); - outb(0x70, 0x22); - outb(0x01, 0x23); - } - enable_apic_mode(); -} - -void disconnect_bsp_APIC(int virt_wire_setup) -{ - if (pic_mode) { - /* - * Put the board back into PIC mode (has an effect - * only on certain older boards). Note that APIC - * interrupts, including IPIs, won't work beyond - * this point! The only exception are INIT IPIs. - */ - apic_printk(APIC_VERBOSE, "disabling APIC mode, " - "entering PIC mode.\n"); - outb(0x70, 0x22); - outb(0x00, 0x23); - } - else { - /* Go back to Virtual Wire compatibility mode */ - unsigned long value; - - /* For the spurious interrupt use vector F, and enable it */ - value = apic_read(APIC_SPIV); - value &= ~APIC_VECTOR_MASK; - value |= APIC_SPIV_APIC_ENABLED; - value |= 0xf; - apic_write_around(APIC_SPIV, value); - - if (!virt_wire_setup) { - /* For LVT0 make it edge triggered, active high, external and enabled */ - value = apic_read(APIC_LVT0); - value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | - APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | - APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED ); - value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; - value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); - apic_write_around(APIC_LVT0, value); - } - else { - /* Disable LVT0 */ - apic_write_around(APIC_LVT0, APIC_LVT_MASKED); - } - - /* For LVT1 make it edge triggered, active high, nmi and enabled */ - value = apic_read(APIC_LVT1); - value &= ~( - APIC_MODE_MASK | APIC_SEND_PENDING | - APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | - APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); - value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; - value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); - apic_write_around(APIC_LVT1, value); - } -} - +/** + * disable_local_APIC - clear and disable the local APIC + */ void disable_local_APIC(void) { unsigned long value; @@ -304,8 +714,13 @@ void disable_local_APIC(void) value &= ~APIC_SPIV_APIC_ENABLED; apic_write_around(APIC_SPIV, value); + /* + * When LAPIC was disabled by the BIOS and enabled by the kernel, + * restore the disabled state. + */ if (enabled_via_apicbase) { unsigned int l, h; + rdmsr(MSR_IA32_APICBASE, l, h); l &= ~MSR_IA32_APICBASE_ENABLE; wrmsr(MSR_IA32_APICBASE, l, h); @@ -313,6 +728,28 @@ void disable_local_APIC(void) } /* + * If Linux enabled the LAPIC against the BIOS default disable it down before + * re-entering the BIOS on shutdown. Otherwise the BIOS may get confused and + * not power-off. Additionally clear all LVT entries before disable_local_APIC + * for the case where Linux didn't enable the LAPIC. + */ +void lapic_shutdown(void) +{ + unsigned long flags; + + if (!cpu_has_apic) + return; + + local_irq_save(flags); + clear_local_APIC(); + + if (enabled_via_apicbase) + disable_local_APIC(); + + local_irq_restore(flags); +} + +/* * This is to verify that we're looking at a real local APIC. * Check these against your board if the CPUs aren't getting * started for no apparent reason. @@ -344,7 +781,7 @@ int __init verify_local_APIC(void) reg1 = GET_APIC_VERSION(reg0); if (reg1 == 0x00 || reg1 == 0xff) return 0; - reg1 = get_maxlvt(); + reg1 = lapic_get_maxlvt(); if (reg1 < 0x02 || reg1 == 0xff) return 0; @@ -367,10 +804,15 @@ int __init verify_local_APIC(void) return 1; } +/** + * sync_Arb_IDs - synchronize APIC bus arbitration IDs + */ void __init sync_Arb_IDs(void) { - /* Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 - And not needed on AMD */ + /* + * Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 And not + * needed on AMD. + */ if (modern_apic()) return; /* @@ -383,14 +825,12 @@ void __init sync_Arb_IDs(void) | APIC_DM_INIT); } -extern void __error_in_apic_c (void); - /* * An initial setup of the virtual wire mode. */ void __init init_bsp_APIC(void) { - unsigned long value, ver; + unsigned long value; /* * Don't do the setup now if we have a SMP BIOS as the @@ -399,9 +839,6 @@ void __init init_bsp_APIC(void) if (smp_found_config || !cpu_has_apic) return; - value = apic_read(APIC_LVR); - ver = GET_APIC_VERSION(value); - /* * Do not trust the local APIC being empty at bootup. */ @@ -413,9 +850,10 @@ void __init init_bsp_APIC(void) value = apic_read(APIC_SPIV); value &= ~APIC_VECTOR_MASK; value |= APIC_SPIV_APIC_ENABLED; - + /* This bit is reserved on P4/Xeon and should be cleared */ - if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 15)) + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && + (boot_cpu_data.x86 == 15)) value &= ~APIC_SPIV_FOCUS_DISABLED; else value |= APIC_SPIV_FOCUS_DISABLED; @@ -427,14 +865,17 @@ void __init init_bsp_APIC(void) */ apic_write_around(APIC_LVT0, APIC_DM_EXTINT); value = APIC_DM_NMI; - if (!APIC_INTEGRATED(ver)) /* 82489DX */ + if (!lapic_is_integrated()) /* 82489DX */ value |= APIC_LVT_LEVEL_TRIGGER; apic_write_around(APIC_LVT1, value); } +/** + * setup_local_APIC - setup the local APIC + */ void __devinit setup_local_APIC(void) { - unsigned long oldvalue, value, ver, maxlvt; + unsigned long oldvalue, value, maxlvt, integrated; int i, j; /* Pound the ESR really hard over the head with a big hammer - mbligh */ @@ -445,11 +886,7 @@ void __devinit setup_local_APIC(void) apic_write(APIC_ESR, 0); } - value = apic_read(APIC_LVR); - ver = GET_APIC_VERSION(value); - - if ((SPURIOUS_APIC_VECTOR & 0x0f) != 0x0f) - __error_in_apic_c(); + integrated = lapic_is_integrated(); /* * Double-check whether this APIC is really registered. @@ -520,13 +957,10 @@ void __devinit setup_local_APIC(void) * like LRU than MRU (the short-term load is more even across CPUs). * See also the comment in end_level_ioapic_irq(). --macro */ -#if 1 + /* Enable focus processor (bit==0) */ value &= ~APIC_SPIV_FOCUS_DISABLED; -#else - /* Disable focus processor (bit==1) */ - value |= APIC_SPIV_FOCUS_DISABLED; -#endif + /* * Set spurious IRQ vector */ @@ -562,17 +996,18 @@ void __devinit setup_local_APIC(void) value = APIC_DM_NMI; else value = APIC_DM_NMI | APIC_LVT_MASKED; - if (!APIC_INTEGRATED(ver)) /* 82489DX */ + if (!integrated) /* 82489DX */ value |= APIC_LVT_LEVEL_TRIGGER; apic_write_around(APIC_LVT1, value); - if (APIC_INTEGRATED(ver) && !esr_disable) { /* !82489DX */ - maxlvt = get_maxlvt(); + if (integrated && !esr_disable) { /* !82489DX */ + maxlvt = lapic_get_maxlvt(); if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */ apic_write(APIC_ESR, 0); oldvalue = apic_read(APIC_ESR); - value = ERROR_APIC_VECTOR; // enables sending errors + /* enables sending errors */ + value = ERROR_APIC_VECTOR; apic_write_around(APIC_LVTERR, value); /* * spec says clear errors after enabling vector. @@ -585,193 +1020,30 @@ void __devinit setup_local_APIC(void) "vector: 0x%08lx after: 0x%08lx\n", oldvalue, value); } else { - if (esr_disable) - /* - * Something untraceble is creating bad interrupts on + if (esr_disable) + /* + * Something untraceble is creating bad interrupts on * secondary quads ... for the moment, just leave the * ESR disabled - we can't do anything useful with the - * errors anyway - mbligh - */ - printk("Leaving ESR disabled.\n"); - else - printk("No ESR for 82489DX.\n"); - } - - setup_apic_nmi_watchdog(NULL); - apic_pm_activate(); -} - -/* - * If Linux enabled the LAPIC against the BIOS default - * disable it down before re-entering the BIOS on shutdown. - * Otherwise the BIOS may get confused and not power-off. - * Additionally clear all LVT entries before disable_local_APIC - * for the case where Linux didn't enable the LAPIC. - */ -void lapic_shutdown(void) -{ - unsigned long flags; - - if (!cpu_has_apic) - return; - - local_irq_save(flags); - clear_local_APIC(); - - if (enabled_via_apicbase) - disable_local_APIC(); - - local_irq_restore(flags); -} - -#ifdef CONFIG_PM - -static struct { - int active; - /* r/w apic fields */ - unsigned int apic_id; - unsigned int apic_taskpri; - unsigned int apic_ldr; - unsigned int apic_dfr; - unsigned int apic_spiv; - unsigned int apic_lvtt; - unsigned int apic_lvtpc; - unsigned int apic_lvt0; - unsigned int apic_lvt1; - unsigned int apic_lvterr; - unsigned int apic_tmict; - unsigned int apic_tdcr; - unsigned int apic_thmr; -} apic_pm_state; - -static int lapic_suspend(struct sys_device *dev, pm_message_t state) -{ - unsigned long flags; - - if (!apic_pm_state.active) - return 0; - - apic_pm_state.apic_id = apic_read(APIC_ID); - apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); - apic_pm_state.apic_ldr = apic_read(APIC_LDR); - apic_pm_state.apic_dfr = apic_read(APIC_DFR); - apic_pm_state.apic_spiv = apic_read(APIC_SPIV); - apic_pm_state.apic_lvtt = apic_read(APIC_LVTT); - apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); - apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); - apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); - apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); - apic_pm_state.apic_tmict = apic_read(APIC_TMICT); - apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); - apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); - - local_irq_save(flags); - disable_local_APIC(); - local_irq_restore(flags); - return 0; -} - -static int lapic_resume(struct sys_device *dev) -{ - unsigned int l, h; - unsigned long flags; - - if (!apic_pm_state.active) - return 0; - - local_irq_save(flags); - - /* - * Make sure the APICBASE points to the right address - * - * FIXME! This will be wrong if we ever support suspend on - * SMP! We'll need to do this as part of the CPU restore! - */ - rdmsr(MSR_IA32_APICBASE, l, h); - l &= ~MSR_IA32_APICBASE_BASE; - l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr; - wrmsr(MSR_IA32_APICBASE, l, h); - - apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED); - apic_write(APIC_ID, apic_pm_state.apic_id); - apic_write(APIC_DFR, apic_pm_state.apic_dfr); - apic_write(APIC_LDR, apic_pm_state.apic_ldr); - apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri); - apic_write(APIC_SPIV, apic_pm_state.apic_spiv); - apic_write(APIC_LVT0, apic_pm_state.apic_lvt0); - apic_write(APIC_LVT1, apic_pm_state.apic_lvt1); - apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); - apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); - apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); - apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); - apic_write(APIC_TMICT, apic_pm_state.apic_tmict); - apic_write(APIC_ESR, 0); - apic_read(APIC_ESR); - apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); - apic_write(APIC_ESR, 0); - apic_read(APIC_ESR); - local_irq_restore(flags); - return 0; -} - -/* - * This device has no shutdown method - fully functioning local APICs - * are needed on every CPU up until machine_halt/restart/poweroff. - */ - -static struct sysdev_class lapic_sysclass = { - set_kset_name("lapic"), - .resume = lapic_resume, - .suspend = lapic_suspend, -}; - -static struct sys_device device_lapic = { - .id = 0, - .cls = &lapic_sysclass, -}; - -static void __devinit apic_pm_activate(void) -{ - apic_pm_state.active = 1; -} - -static int __init init_lapic_sysfs(void) -{ - int error; - - if (!cpu_has_apic) - return 0; - /* XXX: remove suspend/resume procs if !apic_pm_state.active? */ - - error = sysdev_class_register(&lapic_sysclass); - if (!error) - error = sysdev_register(&device_lapic); - return error; -} -device_initcall(init_lapic_sysfs); - -#else /* CONFIG_PM */ + * errors anyway - mbligh + */ + printk(KERN_INFO "Leaving ESR disabled.\n"); + else + printk(KERN_INFO "No ESR for 82489DX.\n"); + } -static void apic_pm_activate(void) { } + /* Disable the local apic timer */ + value = apic_read(APIC_LVTT); + value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); + apic_write_around(APIC_LVTT, value); -#endif /* CONFIG_PM */ + setup_apic_nmi_watchdog(NULL); + apic_pm_activate(); +} /* - * Detect and enable local APICs on non-SMP boards. - * Original code written by Keir Fraser. + * Detect and initialize APIC */ - -static int __init apic_set_verbosity(char *str) -{ - if (strcmp("debug", str) == 0) - apic_verbosity = APIC_DEBUG; - else if (strcmp("verbose", str) == 0) - apic_verbosity = APIC_VERBOSE; - return 1; -} - -__setup("apic=", apic_set_verbosity); - static int __init detect_init_APIC (void) { u32 h, l, features; @@ -783,7 +1055,7 @@ static int __init detect_init_APIC (void switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_AMD: if ((boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model > 1) || - (boot_cpu_data.x86 == 15)) + (boot_cpu_data.x86 == 15)) break; goto no_apic; case X86_VENDOR_INTEL: @@ -797,23 +1069,23 @@ static int __init detect_init_APIC (void if (!cpu_has_apic) { /* - * Over-ride BIOS and try to enable the local - * APIC only if "lapic" specified. + * Over-ride BIOS and try to enable the local APIC only if + * "lapic" specified. */ if (enable_local_apic <= 0) { - printk("Local APIC disabled by BIOS -- " + printk(KERN_INFO "Local APIC disabled by BIOS -- " "you can enable it with \"lapic\"\n"); return -1; } /* - * Some BIOSes disable the local APIC in the - * APIC_BASE MSR. This can only be done in - * software for Intel P6 or later and AMD K7 - * (Model > 1) or later. + * Some BIOSes disable the local APIC in the APIC_BASE + * MSR. This can only be done in software for Intel P6 or later + * and AMD K7 (Model > 1) or later. */ rdmsr(MSR_IA32_APICBASE, l, h); if (!(l & MSR_IA32_APICBASE_ENABLE)) { - printk("Local APIC disabled by BIOS -- reenabling.\n"); + printk(KERN_INFO + "Local APIC disabled by BIOS -- reenabling.\n"); l &= ~MSR_IA32_APICBASE_BASE; l |= MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE; wrmsr(MSR_IA32_APICBASE, l, h); @@ -826,7 +1098,7 @@ static int __init detect_init_APIC (void */ features = cpuid_edx(1); if (!(features & (1 << X86_FEATURE_APIC))) { - printk("Could not enable APIC!\n"); + printk(KERN_WARNING "Could not enable APIC!\n"); return -1; } set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); @@ -840,17 +1112,20 @@ static int __init detect_init_APIC (void if (nmi_watchdog != NMI_NONE) nmi_watchdog = NMI_LOCAL_APIC; - printk("Found and enabled local APIC!\n"); + printk(KERN_INFO "Found and enabled local APIC!\n"); apic_pm_activate(); return 0; no_apic: - printk("No local APIC present or hardware disabled\n"); + printk(KERN_INFO "No local APIC present or hardware disabled\n"); return -1; } +/** + * init_apic_mappings - initialize APIC mappings + */ void __init init_apic_mappings(void) { unsigned long apic_phys; @@ -910,380 +1185,94 @@ fake_ioapic_page: } /* - * This part sets up the APIC 32 bit clock in LVTT1, with HZ interrupts - * per second. We assume that the caller has already set up the local - * APIC. - * - * The APIC timer is not exactly sync with the external timer chip, it - * closely follows bus clocks. - */ - -/* - * The timer chip is already set up at HZ interrupts per second here, - * but we do not accept timer interrupts yet. We only allow the BP - * to calibrate. - */ -static unsigned int __devinit get_8254_timer_count(void) -{ - unsigned long flags; - - unsigned int count; - - spin_lock_irqsave(&i8253_lock, flags); - - outb_p(0x00, PIT_MODE); - count = inb_p(PIT_CH0); - count |= inb_p(PIT_CH0) << 8; - - spin_unlock_irqrestore(&i8253_lock, flags); - - return count; -} - -/* next tick in 8254 can be caught by catching timer wraparound */ -static void __devinit wait_8254_wraparound(void) -{ - unsigned int curr_count, prev_count; - - curr_count = get_8254_timer_count(); - do { - prev_count = curr_count; - curr_count = get_8254_timer_count(); - - /* workaround for broken Mercury/Neptune */ - if (prev_count >= curr_count + 0x100) - curr_count = get_8254_timer_count(); - - } while (prev_count >= curr_count); -} - -/* - * Default initialization for 8254 timers. If we use other timers like HPET, - * we override this later - */ -void (*wait_timer_tick)(void) __devinitdata = wait_8254_wraparound; - -/* - * This function sets up the local APIC timer, with a timeout of - * 'clocks' APIC bus clock. During calibration we actually call - * this function twice on the boot CPU, once with a bogus timeout - * value, second time for real. The other (noncalibrating) CPUs - * call this function only once, with the real, calibrated value. - * - * We do reads before writes even if unnecessary, to get around the - * P5 APIC double write bug. - */ - -#define APIC_DIVISOR 16 - -static void __setup_APIC_LVTT(unsigned int clocks) -{ - unsigned int lvtt_value, tmp_value, ver; - int cpu = smp_processor_id(); - - ver = GET_APIC_VERSION(apic_read(APIC_LVR)); - lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR; - if (!APIC_INTEGRATED(ver)) - lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); - - if (cpu_isset(cpu, timer_bcast_ipi)) - lvtt_value |= APIC_LVT_MASKED; - - apic_write_around(APIC_LVTT, lvtt_value); - - /* - * Divide PICLK by 16 - */ - tmp_value = apic_read(APIC_TDCR); - apic_write_around(APIC_TDCR, (tmp_value - & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) - | APIC_TDR_DIV_16); - - apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR); -} - -static void __devinit setup_APIC_timer(unsigned int clocks) -{ - unsigned long flags; - - local_irq_save(flags); - - /* - * Wait for IRQ0's slice: - */ - wait_timer_tick(); - - __setup_APIC_LVTT(clocks); - - local_irq_restore(flags); -} - -/* - * In this function we calibrate APIC bus clocks to the external - * timer. Unfortunately we cannot use jiffies and the timer irq - * to calibrate, since some later bootup code depends on getting - * the first irq? Ugh. - * - * We want to do the calibration only once since we - * want to have local timer irqs syncron. CPUs connected - * by the same APIC bus have the very same bus frequency. - * And we want to have irqs off anyways, no accidental - * APIC irq that way. + * This initializes the IO-APIC and APIC hardware if this is + * a UP kernel. */ - -static int __init calibrate_APIC_clock(void) +int __init APIC_init_uniprocessor (void) { - unsigned long long t1 = 0, t2 = 0; - long tt1, tt2; - long result; - int i; - const int LOOPS = HZ/10; - - apic_printk(APIC_VERBOSE, "calibrating APIC timer ...\n"); - - /* - * Put whatever arbitrary (but long enough) timeout - * value into the APIC clock, we just want to get the - * counter running for calibration. - */ - __setup_APIC_LVTT(1000000000); - - /* - * The timer chip counts down to zero. Let's wait - * for a wraparound to start exact measurement: - * (the current tick might have been already half done) - */ - - wait_timer_tick(); - - /* - * We wrapped around just now. Let's start: - */ - if (cpu_has_tsc) - rdtscll(t1); - tt1 = apic_read(APIC_TMCCT); - - /* - * Let's wait LOOPS wraprounds: - */ - for (i = 0; i < LOOPS; i++) - wait_timer_tick(); + if (enable_local_apic < 0) + clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); - tt2 = apic_read(APIC_TMCCT); - if (cpu_has_tsc) - rdtscll(t2); + if (!smp_found_config && !cpu_has_apic) + return -1; /* - * The APIC bus clock counter is 32 bits only, it - * might have overflown, but note that we use signed - * longs, thus no extra care needed. - * - * underflown to be exact, as the timer counts down ;) + * Complain if the BIOS pretends there is one. */ + if (!cpu_has_apic && + APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) { + printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n", + boot_cpu_physical_apicid); + clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); + return -1; + } - result = (tt1-tt2)*APIC_DIVISOR/LOOPS; - - if (cpu_has_tsc) - apic_printk(APIC_VERBOSE, "..... CPU clock speed is " - "%ld.%04ld MHz.\n", - ((long)(t2-t1)/LOOPS)/(1000000/HZ), - ((long)(t2-t1)/LOOPS)%(1000000/HZ)); - - apic_printk(APIC_VERBOSE, "..... host bus clock speed is " - "%ld.%04ld MHz.\n", - result/(1000000/HZ), - result%(1000000/HZ)); - - return result; -} - -static unsigned int calibration_result; - -void __init setup_boot_APIC_clock(void) -{ - unsigned long flags; - apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n"); - using_apic_timer = 1; + verify_local_APIC(); - local_irq_save(flags); + connect_bsp_APIC(); - calibration_result = calibrate_APIC_clock(); /* - * Now set up the timer for real. + * Hack: In case of kdump, after a crash, kernel might be booting + * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid + * might be zero if read from MP tables. Get it from LAPIC. */ - setup_APIC_timer(calibration_result); - - local_irq_restore(flags); -} - -void __devinit setup_secondary_APIC_clock(void) -{ - setup_APIC_timer(calibration_result); -} - -void disable_APIC_timer(void) -{ - if (using_apic_timer) { - unsigned long v; - - v = apic_read(APIC_LVTT); - /* - * When an illegal vector value (0-15) is written to an LVT - * entry and delivery mode is Fixed, the APIC may signal an - * illegal vector error, with out regard to whether the mask - * bit is set or whether an interrupt is actually seen on input. - * - * Boot sequence might call this function when the LVTT has - * '0' vector value. So make sure vector field is set to - * valid value. - */ - v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); - apic_write_around(APIC_LVTT, v); - } -} - -void enable_APIC_timer(void) -{ - int cpu = smp_processor_id(); - - if (using_apic_timer && - !cpu_isset(cpu, timer_bcast_ipi)) { - unsigned long v; - - v = apic_read(APIC_LVTT); - apic_write_around(APIC_LVTT, v & ~APIC_LVT_MASKED); - } -} - -void switch_APIC_timer_to_ipi(void *cpumask) -{ - cpumask_t mask = *(cpumask_t *)cpumask; - int cpu = smp_processor_id(); - - if (cpu_isset(cpu, mask) && - !cpu_isset(cpu, timer_bcast_ipi)) { - disable_APIC_timer(); - cpu_set(cpu, timer_bcast_ipi); - } -} -EXPORT_SYMBOL(switch_APIC_timer_to_ipi); - -void switch_ipi_to_APIC_timer(void *cpumask) -{ - cpumask_t mask = *(cpumask_t *)cpumask; - int cpu = smp_processor_id(); - - if (cpu_isset(cpu, mask) && - cpu_isset(cpu, timer_bcast_ipi)) { - cpu_clear(cpu, timer_bcast_ipi); - enable_APIC_timer(); - } -} -EXPORT_SYMBOL(switch_ipi_to_APIC_timer); - -#undef APIC_DIVISOR +#ifdef CONFIG_CRASH_DUMP + boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID)); +#endif + phys_cpu_present_map = physid_mask_of_physid(boot_cpu_physical_apicid); -/* - * Local timer interrupt handler. It does both profiling and - * process statistics/rescheduling. - * - * We do profiling in every local tick, statistics/rescheduling - * happen only every 'profiling multiplier' ticks. The default - * multiplier is 1 and it can be changed by writing the new multiplier - * value into /proc/profile. - */ + setup_local_APIC(); -inline void smp_local_timer_interrupt(void) -{ - profile_tick(CPU_PROFILING); -#ifdef CONFIG_SMP - update_process_times(user_mode_vm(get_irq_regs())); +#ifdef CONFIG_X86_IO_APIC + if (smp_found_config) + if (!skip_ioapic_setup && nr_ioapics) + setup_IO_APIC(); #endif + setup_boot_APIC_clock(); - /* - * We take the 'long' return path, and there every subsystem - * grabs the apropriate locks (kernel lock/ irq lock). - * - * we might want to decouple profiling from the 'long path', - * and do the profiling totally in assembly. - * - * Currently this isn't too much of an issue (performance wise), - * we can take more than 100K local irqs per second on a 100 MHz P5. - */ + return 0; } /* - * Local APIC timer interrupt. This is the most natural way for doing - * local interrupts, but local timer interrupts can be emulated by - * broadcast interrupts too. [in case the hw doesn't support APIC timers] - * - * [ if a single-CPU system runs an SMP kernel then we call the local - * interrupt as well. Thus we cannot inline the local irq ... ] + * APIC command line parameters */ - -fastcall void smp_apic_timer_interrupt(struct pt_regs *regs) +static int __init parse_lapic(char *arg) { - struct pt_regs *old_regs = set_irq_regs(regs); - int cpu = smp_processor_id(); - - /* - * the NMI deadlock-detector uses this. - */ - per_cpu(irq_stat, cpu).apic_timer_irqs++; - - /* - * NOTE! We'd better ACK the irq immediately, - * because timer handling can be slow. - */ - ack_APIC_irq(); - /* - * update_process_times() expects us to have done irq_enter(). - * Besides, if we don't timer interrupts ignore the global - * interrupt lock, which is the WrongThing (tm) to do. - */ - irq_enter(); - smp_local_timer_interrupt(); - irq_exit(); - set_irq_regs(old_regs); + enable_local_apic = 1; + return 0; } +early_param("lapic", parse_lapic); -#ifndef CONFIG_SMP -static void up_apic_timer_interrupt_call(void) +static int __init parse_nolapic(char *arg) { - int cpu = smp_processor_id(); - - /* - * the NMI deadlock-detector uses this. - */ - per_cpu(irq_stat, cpu).apic_timer_irqs++; - - smp_local_timer_interrupt(); + enable_local_apic = -1; + clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); + return 0; } -#endif - -void smp_send_timer_broadcast_ipi(void) -{ - cpumask_t mask; - - cpus_and(mask, cpu_online_map, timer_bcast_ipi); - if (!cpus_empty(mask)) { -#ifdef CONFIG_SMP - send_IPI_mask(mask, LOCAL_TIMER_VECTOR); -#else - /* - * We can directly call the apic timer interrupt handler - * in UP case. Minus all irq related functions - */ - up_apic_timer_interrupt_call(); -#endif - } +early_param("nolapic", parse_nolapic); + +static int __init apic_enable_lapic_timer(char *str) +{ + enable_local_apic_timer = 1; + return 0; } +early_param("lapictimer", apic_enable_lapic_timer); -int setup_profiling_timer(unsigned int multiplier) +static int __init apic_set_verbosity(char *str) { - return -EINVAL; + if (strcmp("debug", str) == 0) + apic_verbosity = APIC_DEBUG; + else if (strcmp("verbose", str) == 0) + apic_verbosity = APIC_VERBOSE; + return 1; } +__setup("apic=", apic_set_verbosity); + +/* + * Local APIC interrupts + */ + /* * This interrupt should _never_ happen with our APIC/SMP architecture */ @@ -1302,15 +1291,14 @@ fastcall void smp_spurious_interrupt(str ack_APIC_irq(); /* see sw-dev-man vol 3, chapter 7.4.13.5 */ - printk(KERN_INFO "spurious APIC interrupt on CPU#%d, should never happen.\n", - smp_processor_id()); + printk(KERN_INFO "spurious APIC interrupt on CPU#%d, " + "should never happen.\n", smp_processor_id()); irq_exit(); } /* * This interrupt should never happen with our APIC/SMP architecture */ - fastcall void smp_error_interrupt(struct pt_regs *regs) { unsigned long v, v1; @@ -1334,69 +1322,262 @@ fastcall void smp_error_interrupt(struct 7: Illegal register address */ printk (KERN_DEBUG "APIC error on CPU%d: %02lx(%02lx)\n", - smp_processor_id(), v , v1); + smp_processor_id(), v , v1); + dump_stack(); irq_exit(); } /* - * This initializes the IO-APIC and APIC hardware if this is - * a UP kernel. + * Initialize APIC interrupts */ -int __init APIC_init_uniprocessor (void) +void __init apic_intr_init(void) { - if (enable_local_apic < 0) - clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); +#ifdef CONFIG_SMP + smp_intr_init(); +#endif + /* self generated IPI for local APIC timer */ + set_intr_gate(LOCAL_TIMER_VECTOR, apic_timer_interrupt); - if (!smp_found_config && !cpu_has_apic) - return -1; + /* IPI vectors for APIC spurious and error interrupts */ + set_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt); + set_intr_gate(ERROR_APIC_VECTOR, error_interrupt); - /* - * Complain if the BIOS pretends there is one. - */ - if (!cpu_has_apic && APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) { - printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n", - boot_cpu_physical_apicid); - clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); - return -1; + /* thermal monitor LVT interrupt */ +#ifdef CONFIG_X86_MCE_P4THERMAL + set_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt); +#endif +} + +/** + * connect_bsp_APIC - attach the APIC to the interrupt system + */ +void __init connect_bsp_APIC(void) +{ + if (pic_mode) { + /* + * Do not trust the local APIC being empty at bootup. + */ + clear_local_APIC(); + /* + * PIC mode, enable APIC mode in the IMCR, i.e. connect BSP's + * local APIC to INT and NMI lines. + */ + apic_printk(APIC_VERBOSE, "leaving PIC mode, " + "enabling APIC mode.\n"); + outb(0x70, 0x22); + outb(0x01, 0x23); } + enable_apic_mode(); +} - verify_local_APIC(); +/** + * disconnect_bsp_APIC - detach the APIC from the interrupt system + * @virt_wire_setup: indicates, whether virtual wire mode is selected + * + * Virtual wire mode is necessary to deliver legacy interrupts even when the + * APIC is disabled. + */ +void disconnect_bsp_APIC(int virt_wire_setup) +{ + if (pic_mode) { + /* + * Put the board back into PIC mode (has an effect only on + * certain older boards). Note that APIC interrupts, including + * IPIs, won't work beyond this point! The only exception are + * INIT IPIs. + */ + apic_printk(APIC_VERBOSE, "disabling APIC mode, " + "entering PIC mode.\n"); + outb(0x70, 0x22); + outb(0x00, 0x23); + } else { + /* Go back to Virtual Wire compatibility mode */ + unsigned long value; - connect_bsp_APIC(); + /* For the spurious interrupt use vector F, and enable it */ + value = apic_read(APIC_SPIV); + value &= ~APIC_VECTOR_MASK; + value |= APIC_SPIV_APIC_ENABLED; + value |= 0xf; + apic_write_around(APIC_SPIV, value); - /* - * Hack: In case of kdump, after a crash, kernel might be booting - * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid - * might be zero if read from MP tables. Get it from LAPIC. - */ -#ifdef CONFIG_CRASH_DUMP - boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID)); -#endif - phys_cpu_present_map = physid_mask_of_physid(boot_cpu_physical_apicid); + if (!virt_wire_setup) { + /* + * For LVT0 make it edge triggered, active high, + * external and enabled + */ + value = apic_read(APIC_LVT0); + value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | + APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | + APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED ); + value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; + value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); + apic_write_around(APIC_LVT0, value); + } else { + /* Disable LVT0 */ + apic_write_around(APIC_LVT0, APIC_LVT_MASKED); + } - setup_local_APIC(); + /* + * For LVT1 make it edge triggered, active high, nmi and + * enabled + */ + value = apic_read(APIC_LVT1); + value &= ~( + APIC_MODE_MASK | APIC_SEND_PENDING | + APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | + APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); + value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; + value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); + apic_write_around(APIC_LVT1, value); + } +} -#ifdef CONFIG_X86_IO_APIC - if (smp_found_config) - if (!skip_ioapic_setup && nr_ioapics) - setup_IO_APIC(); +/* + * Power management + */ +#ifdef CONFIG_PM + +static struct { + int active; + /* r/w apic fields */ + unsigned int apic_id; + unsigned int apic_taskpri; + unsigned int apic_ldr; + unsigned int apic_dfr; + unsigned int apic_spiv; + unsigned int apic_lvtt; + unsigned int apic_lvtpc; + unsigned int apic_lvt0; + unsigned int apic_lvt1; + unsigned int apic_lvterr; + unsigned int apic_tmict; + unsigned int apic_tdcr; + unsigned int apic_thmr; +} apic_pm_state; + +static int lapic_suspend(struct sys_device *dev, pm_message_t state) +{ + unsigned long flags; + int maxlvt; + + if (!apic_pm_state.active) + return 0; + + maxlvt = lapic_get_maxlvt(); + + apic_pm_state.apic_id = apic_read(APIC_ID); + apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); + apic_pm_state.apic_ldr = apic_read(APIC_LDR); + apic_pm_state.apic_dfr = apic_read(APIC_DFR); + apic_pm_state.apic_spiv = apic_read(APIC_SPIV); + apic_pm_state.apic_lvtt = apic_read(APIC_LVTT); + if (maxlvt >= 4) + apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); + apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); + apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); + apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); + apic_pm_state.apic_tmict = apic_read(APIC_TMICT); + apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); +#ifdef CONFIG_X86_MCE_P4THERMAL + if (maxlvt >= 5) + apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); #endif - setup_boot_APIC_clock(); + local_irq_save(flags); + disable_local_APIC(); + local_irq_restore(flags); return 0; } -static int __init parse_lapic(char *arg) +static int lapic_resume(struct sys_device *dev) { - lapic_enable(); + unsigned int l, h; + unsigned long flags; + int maxlvt; + + if (!apic_pm_state.active) + return 0; + + maxlvt = lapic_get_maxlvt(); + + local_irq_save(flags); + + /* + * Make sure the APICBASE points to the right address + * + * FIXME! This will be wrong if we ever support suspend on + * SMP! We'll need to do this as part of the CPU restore! + */ + rdmsr(MSR_IA32_APICBASE, l, h); + l &= ~MSR_IA32_APICBASE_BASE; + l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr; + wrmsr(MSR_IA32_APICBASE, l, h); + + apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED); + apic_write(APIC_ID, apic_pm_state.apic_id); + apic_write(APIC_DFR, apic_pm_state.apic_dfr); + apic_write(APIC_LDR, apic_pm_state.apic_ldr); + apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri); + apic_write(APIC_SPIV, apic_pm_state.apic_spiv); + apic_write(APIC_LVT0, apic_pm_state.apic_lvt0); + apic_write(APIC_LVT1, apic_pm_state.apic_lvt1); +#ifdef CONFIG_X86_MCE_P4THERMAL + if (maxlvt >= 5) + apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); +#endif + if (maxlvt >= 4) + apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); + apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); + apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); + apic_write(APIC_TMICT, apic_pm_state.apic_tmict); + apic_write(APIC_ESR, 0); + apic_read(APIC_ESR); + apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); + apic_write(APIC_ESR, 0); + apic_read(APIC_ESR); + local_irq_restore(flags); return 0; } -early_param("lapic", parse_lapic); -static int __init parse_nolapic(char *arg) +/* + * This device has no shutdown method - fully functioning local APICs + * are needed on every CPU up until machine_halt/restart/poweroff. + */ + +static struct sysdev_class lapic_sysclass = { + set_kset_name("lapic"), + .resume = lapic_resume, + .suspend = lapic_suspend, +}; + +static struct sys_device device_lapic = { + .id = 0, + .cls = &lapic_sysclass, +}; + +static void __devinit apic_pm_activate(void) { - lapic_disable(); - return 0; + apic_pm_state.active = 1; } -early_param("nolapic", parse_nolapic); +static int __init init_lapic_sysfs(void) +{ + int error; + + if (!cpu_has_apic) + return 0; + /* XXX: remove suspend/resume procs if !apic_pm_state.active? */ + + error = sysdev_class_register(&lapic_sysclass); + if (!error) + error = sysdev_register(&device_lapic); + return error; +} +device_initcall(init_lapic_sysfs); + +#else /* CONFIG_PM */ + +static void apic_pm_activate(void) { } + +#endif /* CONFIG_PM */ Index: linux/arch/i386/kernel/apm.c =================================================================== --- linux.orig/arch/i386/kernel/apm.c +++ linux/arch/i386/kernel/apm.c @@ -234,7 +234,6 @@ #include "io_ports.h" -extern unsigned long get_cmos_time(void); extern void machine_real_restart(unsigned char *, int); #if defined(CONFIG_APM_DISPLAY_BLANK) && defined(CONFIG_VT) @@ -1170,28 +1169,6 @@ out: spin_unlock(&user_list_lock); } -static void set_time(void) -{ - struct timespec ts; - if (got_clock_diff) { /* Must know time zone in order to set clock */ - ts.tv_sec = get_cmos_time() + clock_cmos_diff; - ts.tv_nsec = 0; - do_settimeofday(&ts); - } -} - -static void get_time_diff(void) -{ -#ifndef CONFIG_APM_RTC_IS_GMT - /* - * Estimate time zone so that set_time can update the clock - */ - clock_cmos_diff = -get_cmos_time(); - clock_cmos_diff += get_seconds(); - got_clock_diff = 1; -#endif -} - static void reinit_timer(void) { #ifdef INIT_TIMER_AFTER_SUSPEND @@ -1231,19 +1208,6 @@ static int suspend(int vetoable) local_irq_disable(); device_power_down(PMSG_SUSPEND); - /* serialize with the timer interrupt */ - write_seqlock(&xtime_lock); - - /* protect against access to timer chip registers */ - spin_lock(&i8253_lock); - - get_time_diff(); - /* - * Irq spinlock must be dropped around set_system_power_state. - * We'll undo any timer changes due to interrupts below. - */ - spin_unlock(&i8253_lock); - write_sequnlock(&xtime_lock); local_irq_enable(); save_processor_state(); @@ -1252,7 +1216,6 @@ static int suspend(int vetoable) restore_processor_state(); local_irq_disable(); - set_time(); reinit_timer(); if (err == APM_NO_ERROR) @@ -1282,11 +1245,6 @@ static void standby(void) local_irq_disable(); device_power_down(PMSG_SUSPEND); - /* serialize with the timer interrupt */ - write_seqlock(&xtime_lock); - /* If needed, notify drivers here */ - get_time_diff(); - write_sequnlock(&xtime_lock); local_irq_enable(); err = set_system_power_state(APM_STATE_STANDBY); @@ -1380,7 +1338,6 @@ static void check_events(void) ignore_bounce = 1; if ((event != APM_NORMAL_RESUME) || (ignore_normal_resume == 0)) { - set_time(); device_resume(); pm_send_all(PM_RESUME, (void *)0); queue_event(event, NULL); @@ -1396,7 +1353,6 @@ static void check_events(void) break; case APM_UPDATE_TIME: - set_time(); break; case APM_CRITICAL_SUSPEND: Index: linux/arch/i386/kernel/cpu/mcheck/therm_throt.c =================================================================== --- linux.orig/arch/i386/kernel/cpu/mcheck/therm_throt.c +++ linux/arch/i386/kernel/cpu/mcheck/therm_throt.c @@ -115,7 +115,6 @@ static __cpuinit int thermal_throttle_ad return sysfs_create_group(&sys_dev->kobj, &thermal_throttle_attr_group); } -#ifdef CONFIG_HOTPLUG_CPU static __cpuinit void thermal_throttle_remove_dev(struct sys_device *sys_dev) { return sysfs_remove_group(&sys_dev->kobj, &thermal_throttle_attr_group); @@ -152,7 +151,6 @@ static struct notifier_block thermal_thr { .notifier_call = thermal_throttle_cpu_callback, }; -#endif /* CONFIG_HOTPLUG_CPU */ static __init int thermal_throttle_init_device(void) { Index: linux/arch/i386/kernel/cpu/mtrr/generic.c =================================================================== --- linux.orig/arch/i386/kernel/cpu/mtrr/generic.c +++ linux/arch/i386/kernel/cpu/mtrr/generic.c @@ -234,7 +234,7 @@ static unsigned long set_mtrr_state(u32 static unsigned long cr4 = 0; static u32 deftype_lo, deftype_hi; -static DEFINE_SPINLOCK(set_atomicity_lock); +static DEFINE_RAW_SPINLOCK(set_atomicity_lock); /* * Since we are disabling the cache don't allow any interrupts - they Index: linux/arch/i386/kernel/cpu/mtrr/main.c =================================================================== --- linux.orig/arch/i386/kernel/cpu/mtrr/main.c +++ linux/arch/i386/kernel/cpu/mtrr/main.c @@ -135,8 +135,6 @@ struct set_mtrr_data { mtrr_type smp_type; }; -#ifdef CONFIG_SMP - static void ipi_handler(void *info) /* [SUMMARY] Synchronisation handler. Executed by "other" CPUs. [RETURNS] Nothing. @@ -166,8 +164,6 @@ static void ipi_handler(void *info) local_irq_restore(flags); } -#endif - /** * set_mtrr - update mtrrs on all processors * @reg: mtrr in question Index: linux/arch/i386/kernel/cpu/transmeta.c =================================================================== --- linux.orig/arch/i386/kernel/cpu/transmeta.c +++ linux/arch/i386/kernel/cpu/transmeta.c @@ -9,7 +9,8 @@ static void __cpuinit init_transmeta(str { unsigned int cap_mask, uk, max, dummy; unsigned int cms_rev1, cms_rev2; - unsigned int cpu_rev, cpu_freq, cpu_flags, new_cpu_rev; + unsigned int cpu_rev, cpu_freq = 0 /* shut up gcc warning */, + cpu_flags, new_cpu_rev; char cpu_info[65]; get_model_name(c); /* Same as AMD/Cyrix */ Index: linux/arch/i386/kernel/cpuid.c =================================================================== --- linux.orig/arch/i386/kernel/cpuid.c +++ linux/arch/i386/kernel/cpuid.c @@ -167,7 +167,6 @@ static int cpuid_class_device_create(int return err; } -#ifdef CONFIG_HOTPLUG_CPU static int cpuid_class_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { unsigned int cpu = (unsigned long)hcpu; @@ -187,7 +186,6 @@ static struct notifier_block __cpuinitda { .notifier_call = cpuid_class_cpu_callback, }; -#endif /* !CONFIG_HOTPLUG_CPU */ static int __init cpuid_init(void) { Index: linux/arch/i386/kernel/crash.c =================================================================== --- linux.orig/arch/i386/kernel/crash.c +++ linux/arch/i386/kernel/crash.c @@ -132,7 +132,7 @@ static int crash_nmi_callback(struct not return 1; } -static void smp_send_nmi_allbutself(void) +void smp_send_nmi_allbutself(void) { cpumask_t mask = cpu_online_map; cpu_clear(safe_smp_processor_id(), mask); Index: linux/arch/i386/kernel/efi.c =================================================================== --- linux.orig/arch/i386/kernel/efi.c +++ linux/arch/i386/kernel/efi.c @@ -271,7 +271,7 @@ void efi_memmap_walk(efi_freemem_callbac struct range { unsigned long start; unsigned long end; - } prev, curr; + } prev = { } /* shut up gcc */ , curr = { } /* shut up gcc */ ; efi_memory_desc_t *md; unsigned long start, end; void *p; Index: linux/arch/i386/kernel/entry.S =================================================================== --- linux.orig/arch/i386/kernel/entry.S +++ linux/arch/i386/kernel/entry.S @@ -259,14 +259,18 @@ ENTRY(resume_userspace) #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) DISABLE_INTERRUPTS + cmpl $0, kernel_preemption + jz restore_nocheck cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? jnz restore_nocheck need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl - jz restore_all + jz restore_nocheck testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all + jz restore_nocheck + DISABLE_INTERRUPTS + call preempt_schedule_irq jmp need_resched #endif @@ -323,6 +327,11 @@ sysenter_past_esp: pushl %eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL +#ifdef CONFIG_EVENT_TRACE + pushl %edx; pushl %ecx; pushl %ebx; pushl %eax + call sys_call + popl %eax; popl %ebx; popl %ecx; popl %edx +#endif GET_THREAD_INFO(%ebp) /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ @@ -337,6 +346,11 @@ sysenter_past_esp: movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx jne syscall_exit_work +#ifdef CONFIG_EVENT_TRACE + pushl %eax + call sys_ret + popl %eax +#endif /* if something modifies registers it must also disable sysexit */ movl EIP(%esp), %edx movl OLDESP(%esp), %ecx @@ -352,6 +366,11 @@ ENTRY(system_call) pushl %eax # save orig_eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL +#ifdef CONFIG_EVENT_TRACE + pushl %edx; pushl %ecx; pushl %ebx; pushl %eax + call sys_call + popl %eax; popl %ebx; popl %ecx; popl %edx +#endif GET_THREAD_INFO(%ebp) testl $TF_MASK,EFLAGS(%esp) jz no_singlestep @@ -441,19 +460,19 @@ ldt_ss: ALIGN RING0_PTREGS_FRAME # can't unwind into user space anyway work_pending: - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED), %ecx jz work_notifysig work_resched: - call schedule - DISABLE_INTERRUPTS # make sure we don't miss an interrupt + DISABLE_INTERRUPTS + call __schedule + # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? jz restore_all - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED), %ecx jnz work_resched work_notifysig: # deal with pending signals and Index: linux/arch/i386/kernel/head.S =================================================================== --- linux.orig/arch/i386/kernel/head.S +++ linux/arch/i386/kernel/head.S @@ -457,6 +457,7 @@ ignore_int: call printk #endif addl $(5*4),%esp + call dump_stack popl %ds popl %es popl %edx Index: linux/arch/i386/kernel/i386_ksyms.c =================================================================== --- linux.orig/arch/i386/kernel/i386_ksyms.c +++ linux/arch/i386/kernel/i386_ksyms.c @@ -2,10 +2,12 @@ #include #include -EXPORT_SYMBOL(__down_failed); -EXPORT_SYMBOL(__down_failed_interruptible); -EXPORT_SYMBOL(__down_failed_trylock); -EXPORT_SYMBOL(__up_wakeup); +#ifdef CONFIG_ASM_SEMAPHORES +EXPORT_SYMBOL(__compat_down_failed); +EXPORT_SYMBOL(__compat_down_failed_interruptible); +EXPORT_SYMBOL(__compat_down_failed_trylock); +EXPORT_SYMBOL(__compat_up_wakeup); +#endif /* Networking helper routines. */ EXPORT_SYMBOL(csum_partial_copy_generic); @@ -20,7 +22,7 @@ EXPORT_SYMBOL(__put_user_8); EXPORT_SYMBOL(strstr); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) && defined(CONFIG_ASM_SEMAPHORES) extern void FASTCALL( __write_lock_failed(rwlock_t *rw)); extern void FASTCALL( __read_lock_failed(rwlock_t *rw)); EXPORT_SYMBOL(__write_lock_failed); Index: linux/arch/i386/kernel/i8253.c =================================================================== --- linux.orig/arch/i386/kernel/i8253.c +++ linux/arch/i386/kernel/i8253.c @@ -2,7 +2,7 @@ * i8253.c 8253/PIT functions * */ -#include +#include #include #include #include @@ -16,23 +16,99 @@ #include "io_ports.h" -DEFINE_SPINLOCK(i8253_lock); +DEFINE_RAW_SPINLOCK(i8253_lock); EXPORT_SYMBOL(i8253_lock); -void setup_pit_timer(void) +#ifdef CONFIG_HPET_TIMER +/* + * HPET replaces the PIT, when enabled. So we need to know, which of + * the two timers is used + */ +struct clock_event_device *global_clock_event; +#endif + +/* + * Initialize the PIT timer. + * + * This is also called after resume to bring the PIT into operation again. + */ +static void init_pit_timer(enum clock_event_mode mode, + struct clock_event_device *evt) +{ + unsigned long flags; + + spin_lock_irqsave(&i8253_lock, flags); + + switch(mode) { + case CLOCK_EVT_PERIODIC: + /* binary, mode 2, LSB/MSB, ch 0 */ + outb_p(0x34, PIT_MODE); + udelay(10); + outb_p(LATCH & 0xff , PIT_CH0); /* LSB */ + udelay(10); + outb(LATCH >> 8 , PIT_CH0); /* MSB */ + break; + + case CLOCK_EVT_ONESHOT: + case CLOCK_EVT_SHUTDOWN: + /* One shot setup */ + outb_p(0x38, PIT_MODE); + udelay(10); + break; + } + spin_unlock_irqrestore(&i8253_lock, flags); +} + +/* + * Program the next event in oneshot mode + * + * Delta is given in PIT ticks + */ +static void pit_next_event(unsigned long delta, struct clock_event_device *evt) { unsigned long flags; spin_lock_irqsave(&i8253_lock, flags); - outb_p(0x34,PIT_MODE); /* binary, mode 2, LSB/MSB, ch 0 */ - udelay(10); - outb_p(LATCH & 0xff , PIT_CH0); /* LSB */ - udelay(10); - outb(LATCH >> 8 , PIT_CH0); /* MSB */ + outb_p(delta & 0xff , PIT_CH0); /* LSB */ + outb(delta >> 8 , PIT_CH0); /* MSB */ spin_unlock_irqrestore(&i8253_lock, flags); } /* + * On UP the PIT can serve all of the possible timer functions. On SMP systems + * it can be solely used for the global tick. + * + * The profiling and update capabilites are switched off once the local apic is + * registered. This mechanism replaces the previous #ifdef LOCAL_APIC - + * !using_apic_timer decisions in do_timer_interrupt_hook() + */ +struct clock_event_device pit_clockevent = { + .name = "pit", + .capabilities = CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE + | CLOCK_CAP_NEXTEVT, + .set_mode = init_pit_timer, + .set_next_event = pit_next_event, + .shift = 32, +}; + +/* + * Initialize the conversion factor and the min/max deltas of the clock event + * structure and register the clock event source with the framework. + */ +void __init setup_pit_timer(void) +{ + pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32); + pit_clockevent.max_delta_ns = + clockevent_delta2ns(0x7FFF, &pit_clockevent); + pit_clockevent.min_delta_ns = + clockevent_delta2ns(0xF, &pit_clockevent); + register_global_clockevent(&pit_clockevent); +#ifdef CONFIG_HPET_TIMER + global_clock_event = &pit_clockevent; +#endif +} + +/* * Since the PIT overflows every tick, its not very useful * to just read by itself. So use jiffies to emulate a free * running counter: @@ -46,7 +122,7 @@ static cycle_t pit_read(void) static u32 old_jifs; spin_lock_irqsave(&i8253_lock, flags); - /* + /* * Although our caller may have the read side of xtime_lock, * this is now a seqlock, and we are cheating in this routine * by having side effects on state that we cannot undo if Index: linux/arch/i386/kernel/i8259.c =================================================================== --- linux.orig/arch/i386/kernel/i8259.c +++ linux/arch/i386/kernel/i8259.c @@ -35,12 +35,13 @@ */ static int i8259A_auto_eoi; -DEFINE_SPINLOCK(i8259A_lock); +DEFINE_RAW_SPINLOCK(i8259A_lock); static void mask_and_ack_8259A(unsigned int); static struct irq_chip i8259A_chip = { .name = "XT-PIC", .mask = disable_8259A_irq, + .disable = disable_8259A_irq, .unmask = enable_8259A_irq, .mask_ack = mask_and_ack_8259A, }; @@ -170,6 +171,8 @@ static void mask_and_ack_8259A(unsigned */ if (cached_irq_mask & irqmask) goto spurious_8259A_irq; + if (irq & 8) + outb(0x60+(irq&7),PIC_SLAVE_CMD); /* 'Specific EOI' to slave */ cached_irq_mask |= irqmask; handle_real_irq: @@ -297,10 +300,10 @@ void init_8259A(int auto_eoi) outb_p(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init */ outb_p(0x20 + 0, PIC_MASTER_IMR); /* ICW2: 8259A-1 IR0-7 mapped to 0x20-0x27 */ outb_p(1U << PIC_CASCADE_IR, PIC_MASTER_IMR); /* 8259A-1 (the master) has a slave on IR2 */ - if (auto_eoi) /* master does Auto EOI */ - outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); - else /* master expects normal EOI */ + if (!auto_eoi) /* master expects normal EOI */ outb_p(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); + else /* master does Auto EOI */ + outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); outb_p(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ outb_p(0x20 + 8, PIC_SLAVE_IMR); /* ICW2: 8259A-2 IR0-7 mapped to 0x28-0x2f */ @@ -350,7 +353,7 @@ static irqreturn_t math_error_irq(int cp * New motherboards sometimes make IRQ 13 be a PCI interrupt, * so allow interrupt sharing. */ -static struct irqaction fpu_irq = { math_error_irq, 0, CPU_MASK_NONE, "fpu", NULL, NULL }; +static struct irqaction fpu_irq = { math_error_irq, IRQF_NODELAY, CPU_MASK_NONE, "fpu", NULL, NULL }; void __init init_ISA_irqs (void) { Index: linux/arch/i386/kernel/io_apic.c =================================================================== --- linux.orig/arch/i386/kernel/io_apic.c +++ linux/arch/i386/kernel/io_apic.c @@ -55,8 +55,8 @@ atomic_t irq_mis_count; /* Where if anywhere is the i8259 connect in external int mode */ static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; -static DEFINE_SPINLOCK(ioapic_lock); -static DEFINE_SPINLOCK(vector_lock); +static DEFINE_RAW_SPINLOCK(ioapic_lock); +static DEFINE_RAW_SPINLOCK(vector_lock); int timer_over_8254 __initdata = 1; @@ -254,18 +254,6 @@ static void __unmask_IO_APIC_irq (unsign __modify_IO_APIC_irq(irq, 0, 0x00010000); } -/* mask = 1, trigger = 0 */ -static void __mask_and_edge_IO_APIC_irq (unsigned int irq) -{ - __modify_IO_APIC_irq(irq, 0x00010000, 0x00008000); -} - -/* mask = 0, trigger = 1 */ -static void __unmask_and_level_IO_APIC_irq (unsigned int irq) -{ - __modify_IO_APIC_irq(irq, 0x00008000, 0x00010000); -} - static void mask_IO_APIC_irq (unsigned int irq) { unsigned long flags; @@ -1288,7 +1276,6 @@ static void ioapic_register_intr(int irq set_irq_chip_and_handler_name(irq, &ioapic_chip, handle_fasteoi_irq, "fasteoi"); else { - irq_desc[irq].status |= IRQ_DELAYED_DISABLE; set_irq_chip_and_handler_name(irq, &ioapic_chip, handle_edge_irq, "edge"); } @@ -1557,7 +1544,7 @@ void __init print_IO_APIC(void) return; } -#if 0 +#if 1 static void print_APIC_bitfield (int base) { @@ -1594,7 +1581,7 @@ void /*__init*/ print_local_APIC(void * v = apic_read(APIC_LVR); printk(KERN_INFO "... APIC VERSION: %08x\n", v); ver = GET_APIC_VERSION(v); - maxlvt = get_maxlvt(); + maxlvt = lapic_get_maxlvt(); v = apic_read(APIC_TASKPRI); printk(KERN_DEBUG "... APIC TASKPRI: %08x (%02x)\n", v, v & APIC_TPRI_MASK); @@ -1949,7 +1936,7 @@ static int __init timer_irq_works(void) * might have cached one ExtINT interrupt. Finally, at * least one tick may be lost due to delays. */ - if (jiffies - t1 > 4) + if (jiffies - t1 > 4 && jiffies - t1 < 16) return 1; return 0; @@ -2038,8 +2025,10 @@ static void ack_ioapic_quirk_irq(unsigne if (!(v & (1 << (i & 0x1f)))) { atomic_inc(&irq_mis_count); spin_lock(&ioapic_lock); - __mask_and_edge_IO_APIC_irq(irq); - __unmask_and_level_IO_APIC_irq(irq); + /* mask = 1, trigger = 0 */ + __modify_IO_APIC_irq(irq, 0x00010000, 0x00008000); + /* mask = 0, trigger = 1 */ + __modify_IO_APIC_irq(irq, 0x00008000, 0x00010000); spin_unlock(&ioapic_lock); } } @@ -2479,7 +2468,7 @@ device_initcall(ioapic_init_sysfs); int create_irq(void) { /* Allocate an unused irq */ - int irq, new, vector; + int irq, new, vector = 0 /* shut up GCC */; unsigned long flags; irq = -ENOSPC; Index: linux/arch/i386/kernel/irq.c =================================================================== --- linux.orig/arch/i386/kernel/irq.c +++ linux/arch/i386/kernel/irq.c @@ -10,7 +10,6 @@ * io_apic.c.) */ -#include #include #include #include @@ -19,19 +18,34 @@ #include #include +#include +#include + DEFINE_PER_CPU(irq_cpustat_t, irq_stat) ____cacheline_internodealigned_in_smp; EXPORT_PER_CPU_SYMBOL(irq_stat); -#ifndef CONFIG_X86_LOCAL_APIC /* * 'what should we do if we get a hw irq event on an illegal vector'. * each architecture has to answer this themselves. */ void ack_bad_irq(unsigned int irq) { - printk("unexpected IRQ trap at vector %02x\n", irq); -} + printk(KERN_ERR "unexpected IRQ trap at vector %02x\n", irq); + +#ifdef CONFIG_X86_LOCAL_APIC + /* + * Currently unexpected vectors happen only on SMP and APIC. + * We _must_ ack these because every local APIC has only N + * irq slots per priority level, and a 'hanging, unacked' IRQ + * holds up an irq slot - in excessive cases (when multiple + * unexpected vectors occur) that might lock up the APIC + * completely. + * But only ack when the APIC is enabled -AK + */ + if (cpu_has_apic) + ack_APIC_irq(); #endif +} #ifdef CONFIG_4KSTACKS /* @@ -51,7 +65,7 @@ static union irq_ctx *softirq_ctx[NR_CPU * SMP cross-CPU interrupts have their own specific * handlers). */ -fastcall unsigned int do_IRQ(struct pt_regs *regs) +fastcall notrace unsigned int do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs; /* high bit used in ret_from_ code */ @@ -62,6 +76,8 @@ fastcall unsigned int do_IRQ(struct pt_r u32 *isp; #endif + irq_show_regs_callback(smp_processor_id(), regs); + if (unlikely((unsigned)irq >= NR_IRQS)) { printk(KERN_EMERG "%s: cannot handle IRQ %d\n", __FUNCTION__, irq); @@ -70,6 +86,11 @@ fastcall unsigned int do_IRQ(struct pt_r old_regs = set_irq_regs(regs); irq_enter(); +#ifdef CONFIG_EVENT_TRACE + if (irq == trace_user_trigger_irq) + user_trace_start(); +#endif + trace_special(regs->eip, irq, 0); #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -78,7 +99,7 @@ fastcall unsigned int do_IRQ(struct pt_r __asm__ __volatile__("andl %%esp,%0" : "=r" (esp) : "0" (THREAD_SIZE - 1)); if (unlikely(esp < (sizeof(struct thread_info) + STACK_WARN))) { - printk("do_IRQ: stack overflow: %ld\n", + printk("BUG: do_IRQ: stack overflow: %ld\n", esp - sizeof(struct thread_info)); dump_stack(); } Index: linux/arch/i386/kernel/kprobes.c =================================================================== --- linux.orig/arch/i386/kernel/kprobes.c +++ linux/arch/i386/kernel/kprobes.c @@ -338,7 +338,7 @@ ss_probe: /* Boost up -- we can execute copied instructions directly */ reset_current_kprobe(); regs->eip = (unsigned long)p->ainsn.insn; - preempt_enable_no_resched(); + preempt_enable(); return 1; } #endif @@ -347,7 +347,7 @@ ss_probe: return 1; no_kprobe: - preempt_enable_no_resched(); + preempt_enable(); return ret; } @@ -579,7 +579,7 @@ static int __kprobes post_kprobe_handler } reset_current_kprobe(); out: - preempt_enable_no_resched(); + preempt_enable(); /* * if somebody else is singlestepping across a probe point, eflags @@ -613,7 +613,7 @@ static int __kprobes kprobe_fault_handle restore_previous_kprobe(kcb); else reset_current_kprobe(); - preempt_enable_no_resched(); + preempt_enable(); break; case KPROBE_HIT_ACTIVE: case KPROBE_HIT_SSDONE: @@ -747,7 +747,7 @@ int __kprobes longjmp_break_handler(stru *regs = kcb->jprobe_saved_regs; memcpy((kprobe_opcode_t *) stack_addr, kcb->jprobes_stack, MIN_STACK_SIZE(stack_addr)); - preempt_enable_no_resched(); + preempt_enable(); return 1; } return 0; Index: linux/arch/i386/kernel/mcount-wrapper.S =================================================================== --- /dev/null +++ linux/arch/i386/kernel/mcount-wrapper.S @@ -0,0 +1,27 @@ +/* + * linux/arch/i386/mcount-wrapper.S + * + * Copyright (C) 2004 Ingo Molnar + */ + +.globl mcount +mcount: + + cmpl $0, mcount_enabled + jz out + + push %ebp + mov %esp, %ebp + pushl %eax + pushl %ecx + pushl %edx + + call __mcount + + popl %edx + popl %ecx + popl %eax + popl %ebp +out: + ret + Index: linux/arch/i386/kernel/microcode.c =================================================================== --- linux.orig/arch/i386/kernel/microcode.c +++ linux/arch/i386/kernel/microcode.c @@ -116,7 +116,7 @@ MODULE_LICENSE("GPL"); #define exttable_size(et) ((et)->count * EXT_SIGNATURE_SIZE + EXT_HEADER_SIZE) /* serialize access to the physical write to MSR 0x79 */ -static DEFINE_SPINLOCK(microcode_update_lock); +static DEFINE_RAW_SPINLOCK(microcode_update_lock); /* no concurrent ->write()s are allowed on /dev/cpu/microcode */ static DEFINE_MUTEX(microcode_mutex); @@ -384,7 +384,7 @@ static int do_microcode_update (void) { long cursor = 0; int error = 0; - void *new_mc; + void *new_mc = NULL /* shut up GCC */; int cpu; cpumask_t old; @@ -703,7 +703,6 @@ static struct sysdev_driver mc_sysdev_dr .resume = mc_sysdev_resume, }; -#ifdef CONFIG_HOTPLUG_CPU static __cpuinit int mc_cpu_callback(struct notifier_block *nb, unsigned long action, void *hcpu) { @@ -726,7 +725,6 @@ mc_cpu_callback(struct notifier_block *n static struct notifier_block mc_cpu_notifier = { .notifier_call = mc_cpu_callback, }; -#endif static int __init microcode_init (void) { Index: linux/arch/i386/kernel/mpparse.c =================================================================== --- linux.orig/arch/i386/kernel/mpparse.c +++ linux/arch/i386/kernel/mpparse.c @@ -479,7 +479,7 @@ static int __init smp_read_mpc(struct mp } ++mpc_record; } - clustered_apic_check(); + setup_apic_routing(); if (!num_processors) printk(KERN_ERR "SMP mptable: no processors registered!\n"); return num_processors; Index: linux/arch/i386/kernel/msr.c =================================================================== --- linux.orig/arch/i386/kernel/msr.c +++ linux/arch/i386/kernel/msr.c @@ -250,7 +250,6 @@ static int msr_class_device_create(int i return err; } -#ifdef CONFIG_HOTPLUG_CPU static int msr_class_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -271,7 +270,6 @@ static struct notifier_block __cpuinitda { .notifier_call = msr_class_cpu_callback, }; -#endif static int __init msr_init(void) { Index: linux/arch/i386/kernel/nmi.c =================================================================== --- linux.orig/arch/i386/kernel/nmi.c +++ linux/arch/i386/kernel/nmi.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -56,7 +57,7 @@ static DEFINE_PER_CPU(unsigned long, evn atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ unsigned int nmi_watchdog = NMI_DEFAULT; -static unsigned int nmi_hz = HZ; +static unsigned int nmi_hz = 1000; struct nmi_watchdog_ctlblk { int enabled; @@ -192,7 +193,6 @@ static __cpuinit inline int nmi_known_cp return 0; } -#ifdef CONFIG_SMP /* The performance counters used by NMI_LOCAL_APIC don't trigger when * the CPU is idle. To make sure the NMI watchdog really ticks on all * CPUs during the test make them busy. @@ -200,7 +200,12 @@ static __cpuinit inline int nmi_known_cp static __init void nmi_cpu_busy(void *data) { volatile int *endflag = data; + /* + * avoid a warning, on PREEMPT_RT this wont run in hardirq context: + */ +#ifndef CONFIG_PREEMPT_RT local_irq_enable_in_hardirq(); +#endif /* Intentionally don't use cpu_relax here. This is to make sure that the performance counter really ticks, even if there is a simulator or similar that catches the @@ -210,7 +215,6 @@ static __init void nmi_cpu_busy(void *da while (*endflag == 0) barrier(); } -#endif static int __init check_nmi_watchdog(void) { @@ -244,7 +248,7 @@ static int __init check_nmi_watchdog(voi for_each_possible_cpu(cpu) prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count; local_irq_enable(); - mdelay((10*1000)/nmi_hz); // wait 10 ticks + mdelay((100*1000)/nmi_hz); // wait 100 ticks for_each_possible_cpu(cpu) { #ifdef CONFIG_SMP @@ -277,7 +281,6 @@ static int __init check_nmi_watchdog(voi if (nmi_watchdog == NMI_LOCAL_APIC) { struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk); - nmi_hz = 1; /* * On Intel CPUs with ARCH_PERFMON only 32 bits in the counter * are writable, with higher bits sign extending from bit 31. @@ -292,6 +295,7 @@ static int __init check_nmi_watchdog(voi nmi_hz = count + 1; } } + nmi_hz = 10000; kfree(prev_nmi_count); return 0; @@ -372,6 +376,34 @@ void enable_timer_nmi_watchdog(void) } } +static void __acpi_nmi_disable(void *__unused) +{ + apic_write_around(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); +} + +/* + * Disable timer based NMIs on all CPUs: + */ +void acpi_nmi_disable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_disable, NULL, 0, 1); +} + +static void __acpi_nmi_enable(void *__unused) +{ + apic_write_around(APIC_LVT0, APIC_DM_NMI); +} + +/* + * Enable timer based NMIs on all CPUs: + */ +void acpi_nmi_enable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_enable, NULL, 0, 1); +} + #ifdef CONFIG_PM static int nmi_pm_active; /* nmi_active before suspend */ @@ -885,9 +917,48 @@ EXPORT_SYMBOL(touch_nmi_watchdog); extern void die_nmi(struct pt_regs *, const char *msg); -__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +int nmi_show_regs[NR_CPUS]; + +void nmi_show_all_regs(void) { + int i; + if (system_state == SYSTEM_BOOTING) { + printk("nmi_show_all_regs(): system state %d, not doing.\n", + system_state); + return; + } + printk("nmi_show_all_regs(): start on CPU#%d.\n", + raw_smp_processor_id()); + dump_stack(); + + for_each_online_cpu(i) + nmi_show_regs[i] = 1; + + smp_send_nmi_allbutself(); + + for_each_online_cpu(i) { + while (nmi_show_regs[i] == 1) + barrier(); + } +} + +static DEFINE_RAW_SPINLOCK(nmi_print_lock); + +notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return; + + nmi_show_regs[cpu] = 0; + spin_lock(&nmi_print_lock); + printk("NMI show regs on CPU#%d:\n", cpu); + show_regs(regs); + spin_unlock(&nmi_print_lock); +} + +notrace __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +{ /* * Since current_thread_info()-> is always on the stack, and we * always switch the stack NMI-atomically, it's safe to use @@ -900,6 +971,8 @@ __kprobes int nmi_watchdog_tick(struct p u64 dummy; int rc=0; + __profile_tick(CPU_PROFILING, regs); + /* check for other users first */ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP) { @@ -907,20 +980,46 @@ __kprobes int nmi_watchdog_tick(struct p touched = 1; } - sum = per_cpu(irq_stat, cpu).apic_timer_irqs; + /* + * Take the local apic timer and PIT/HPET into account. We don't + * know which one is active, when we have highres/dyntick on + */ + sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0); + + irq_show_regs_callback(cpu, regs); /* if the apic timer isn't firing, this cpu isn't doing much */ + /* if the none of the timers isn't firing, this cpu isn't doing much */ if (!touched && last_irq_sums[cpu] == sum) { /* * Ayiee, looks like this CPU is stuck ... * wait a few IRQs (5 seconds) before doing the oops ... */ alert_counter[cpu]++; - if (alert_counter[cpu] == 5*nmi_hz) - /* - * die_nmi will return ONLY if NOTIFY_STOP happens.. - */ - die_nmi(regs, "BUG: NMI Watchdog detected LOCKUP"); + if (alert_counter[cpu] && !(alert_counter[cpu] % (5*nmi_hz))) { + int i; + +// bust_spinlocks(1); + spin_lock(&nmi_print_lock); + printk("NMI watchdog detected lockup on CPU#%d (%d/%d)\n", + cpu, alert_counter[cpu], 5*nmi_hz); + show_regs(regs); + spin_unlock(&nmi_print_lock); + + for_each_online_cpu(i) { + if (i == cpu) + continue; + nmi_show_regs[i] = 1; + while (nmi_show_regs[i] == 1) + cpu_relax(); + } + printk("letting NMI watchdog run again ...\n"); + for_each_online_cpu(i) + alert_counter[i] = 0; + +// die_nmi(regs, "NMI Watchdog detected LOCKUP"); + } + } else { last_irq_sums[cpu] = sum; alert_counter[cpu] = 0; Index: linux/arch/i386/kernel/process.c =================================================================== --- linux.orig/arch/i386/kernel/process.c +++ linux/arch/i386/kernel/process.c @@ -104,16 +104,16 @@ void default_idle(void) if (!hlt_counter && boot_cpu_data.hlt_works_ok) { current_thread_info()->status &= ~TS_POLLING; smp_mb__after_clear_bit(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { local_irq_disable(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) safe_halt(); else local_irq_enable(); } current_thread_info()->status |= TS_POLLING; } else { - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) cpu_relax(); } } @@ -135,7 +135,8 @@ static void poll_idle (void) "testl %0, %1;" "rep; nop;" "je 2b;" - : : "i"(_TIF_NEED_RESCHED), "m" (current_thread_info()->flags)); + : : "i"(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED), + "m" (current_thread_info()->flags)); } #ifdef CONFIG_HOTPLUG_CPU @@ -178,7 +179,10 @@ void cpu_idle(void) /* endless idle loop with no priority at all */ while (1) { - while (!need_resched()) { + BUG_ON(irqs_disabled()); + + hrtimer_stop_sched_tick(); + while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); if (__get_cpu_var(cpu_idle_state)) @@ -196,9 +200,12 @@ void cpu_idle(void) __get_cpu_var(irq_stat).idle_timestamp = jiffies; idle(); } - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + hrtimer_restart_sched_tick(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } @@ -244,10 +251,10 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait); */ void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) __mwait(eax, ecx); } } @@ -256,7 +263,7 @@ void mwait_idle_with_hints(unsigned long static void mwait_idle(void) { local_irq_enable(); - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) mwait_idle_with_hints(0, 0); } @@ -365,15 +372,23 @@ void exit_thread(void) if (unlikely(test_thread_flag(TIF_IO_BITMAP))) { struct task_struct *tsk = current; struct thread_struct *t = &tsk->thread; - int cpu = get_cpu(); - struct tss_struct *tss = &per_cpu(init_tss, cpu); + void *io_bitmap_ptr = t->io_bitmap_ptr; + int cpu; + struct tss_struct *tss; - kfree(t->io_bitmap_ptr); + /* + * On PREEMPT_RT we must not call kfree() with + * preemption disabled, so we first zap the pointer: + */ t->io_bitmap_ptr = NULL; + kfree(io_bitmap_ptr); + clear_thread_flag(TIF_IO_BITMAP); /* * Careful, clear this in the TSS too: */ + cpu = get_cpu(); + tss = &per_cpu(init_tss, cpu); memset(tss->io_bitmap, 0xff, tss->io_bitmap_max); t->io_bitmap_max = 0; tss->io_bitmap_owner = NULL; Index: linux/arch/i386/kernel/setup.c =================================================================== --- linux.orig/arch/i386/kernel/setup.c +++ linux/arch/i386/kernel/setup.c @@ -62,7 +62,7 @@ #include #include #include - +#include /* Forward Declaration. */ void __init find_max_pfn(void); @@ -1479,6 +1479,7 @@ void __init setup_arch(char **cmdline_p) #endif #endif tsc_init(); + vsyscall_init(); } static __init int add_pcspkr(void) Index: linux/arch/i386/kernel/signal.c =================================================================== --- linux.orig/arch/i386/kernel/signal.c +++ linux/arch/i386/kernel/signal.c @@ -532,6 +532,13 @@ handle_signal(unsigned long sig, siginfo } } +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * If TF is set due to a debugger (PT_DTRACE), clear the TF flag so * that register information in the sigcontext is correct. @@ -572,6 +579,13 @@ static void fastcall do_signal(struct pt struct k_sigaction ka; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux/arch/i386/kernel/smp.c =================================================================== --- linux.orig/arch/i386/kernel/smp.c +++ linux/arch/i386/kernel/smp.c @@ -255,7 +255,7 @@ void send_IPI_mask_sequence(cpumask_t ma static cpumask_t flush_cpumask; static struct mm_struct * flush_mm; static unsigned long flush_va; -static DEFINE_SPINLOCK(tlbstate_lock); +static DEFINE_RAW_SPINLOCK(tlbstate_lock); #define FLUSH_ALL 0xffffffff /* @@ -402,7 +402,7 @@ static void flush_tlb_others(cpumask_t c while (!cpus_empty(flush_cpumask)) /* nothing. lockup detection does not belong here */ - mb(); + cpu_relax(); flush_mm = NULL; flush_va = 0; @@ -493,10 +493,20 @@ void smp_send_reschedule(int cpu) } /* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + send_IPI_allbutself(RESCHEDULE_VECTOR); +} + +/* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); struct call_data_struct { void (*func) (void *info); @@ -601,15 +611,16 @@ void smp_send_stop(void) } /* - * Reschedule call back. Nothing to do, - * all the work is done automatically when - * we return from the interrupt. + * Reschedule call back. Trigger a reschedule pass so that + * RT-overload balancing can pass tasks around. */ -fastcall void smp_reschedule_interrupt(struct pt_regs *regs) +fastcall notrace void smp_reschedule_interrupt(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); + trace_special(regs->eip, 0, 0); ack_APIC_irq(); set_irq_regs(old_regs); + set_tsk_need_resched(current); } fastcall void smp_call_function_interrupt(struct pt_regs *regs) Index: linux/arch/i386/kernel/smpboot.c =================================================================== --- linux.orig/arch/i386/kernel/smpboot.c +++ linux/arch/i386/kernel/smpboot.c @@ -88,12 +88,6 @@ cpumask_t cpu_possible_map; EXPORT_SYMBOL(cpu_possible_map); static cpumask_t smp_commenced_mask; -/* TSC's upper 32 bits can't be written in eariler CPU (before prescott), there - * is no way to resync one AP against BP. TBD: for prescott and above, we - * should use IA64's algorithm - */ -static int __devinitdata tsc_sync_disabled; - /* Per CPU bogomips and other parameters */ struct cpuinfo_x86 cpu_data[NR_CPUS] __cacheline_aligned; EXPORT_SYMBOL(cpu_data); @@ -210,151 +204,6 @@ valid_k7: ; } -/* - * TSC synchronization. - * - * We first check whether all CPUs have their TSC's synchronized, - * then we print a warning if not, and always resync. - */ - -static struct { - atomic_t start_flag; - atomic_t count_start; - atomic_t count_stop; - unsigned long long values[NR_CPUS]; -} tsc __initdata = { - .start_flag = ATOMIC_INIT(0), - .count_start = ATOMIC_INIT(0), - .count_stop = ATOMIC_INIT(0), -}; - -#define NR_LOOPS 5 - -static void __init synchronize_tsc_bp(void) -{ - int i; - unsigned long long t0; - unsigned long long sum, avg; - long long delta; - unsigned int one_usec; - int buggy = 0; - - printk(KERN_INFO "checking TSC synchronization across %u CPUs: ", num_booting_cpus()); - - /* convert from kcyc/sec to cyc/usec */ - one_usec = cpu_khz / 1000; - - atomic_set(&tsc.start_flag, 1); - wmb(); - - /* - * We loop a few times to get a primed instruction cache, - * then the last pass is more or less synchronized and - * the BP and APs set their cycle counters to zero all at - * once. This reduces the chance of having random offsets - * between the processors, and guarantees that the maximum - * delay between the cycle counters is never bigger than - * the latency of information-passing (cachelines) between - * two CPUs. - */ - for (i = 0; i < NR_LOOPS; i++) { - /* - * all APs synchronize but they loop on '== num_cpus' - */ - while (atomic_read(&tsc.count_start) != num_booting_cpus()-1) - cpu_relax(); - atomic_set(&tsc.count_stop, 0); - wmb(); - /* - * this lets the APs save their current TSC: - */ - atomic_inc(&tsc.count_start); - - rdtscll(tsc.values[smp_processor_id()]); - /* - * We clear the TSC in the last loop: - */ - if (i == NR_LOOPS-1) - write_tsc(0, 0); - - /* - * Wait for all APs to leave the synchronization point: - */ - while (atomic_read(&tsc.count_stop) != num_booting_cpus()-1) - cpu_relax(); - atomic_set(&tsc.count_start, 0); - wmb(); - atomic_inc(&tsc.count_stop); - } - - sum = 0; - for (i = 0; i < NR_CPUS; i++) { - if (cpu_isset(i, cpu_callout_map)) { - t0 = tsc.values[i]; - sum += t0; - } - } - avg = sum; - do_div(avg, num_booting_cpus()); - - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_isset(i, cpu_callout_map)) - continue; - delta = tsc.values[i] - avg; - if (delta < 0) - delta = -delta; - /* - * We report bigger than 2 microseconds clock differences. - */ - if (delta > 2*one_usec) { - long long realdelta; - - if (!buggy) { - buggy = 1; - printk("\n"); - } - realdelta = delta; - do_div(realdelta, one_usec); - if (tsc.values[i] < avg) - realdelta = -realdelta; - - if (realdelta) - printk(KERN_INFO "CPU#%d had %Ld usecs TSC " - "skew, fixed it up.\n", i, realdelta); - } - } - if (!buggy) - printk("passed.\n"); -} - -static void __init synchronize_tsc_ap(void) -{ - int i; - - /* - * Not every cpu is online at the time - * this gets called, so we first wait for the BP to - * finish SMP initialization: - */ - while (!atomic_read(&tsc.start_flag)) - cpu_relax(); - - for (i = 0; i < NR_LOOPS; i++) { - atomic_inc(&tsc.count_start); - while (atomic_read(&tsc.count_start) != num_booting_cpus()) - cpu_relax(); - - rdtscll(tsc.values[smp_processor_id()]); - if (i == NR_LOOPS-1) - write_tsc(0, 0); - - atomic_inc(&tsc.count_stop); - while (atomic_read(&tsc.count_stop) != num_booting_cpus()) - cpu_relax(); - } -} -#undef NR_LOOPS - extern void calibrate_delay(void); static atomic_t init_deasserted; @@ -432,20 +281,12 @@ static void __devinit smp_callin(void) /* * Save our processor parameters */ - smp_store_cpu_info(cpuid); - - disable_APIC_timer(); + smp_store_cpu_info(cpuid); /* * Allow the master to continue. */ cpu_set(cpuid, cpu_callin_map); - - /* - * Synchronize the TSC with the BP - */ - if (cpu_has_tsc && cpu_khz && !tsc_sync_disabled) - synchronize_tsc_ap(); } static int cpucount; @@ -545,13 +386,17 @@ static void __devinit start_secondary(vo smp_callin(); while (!cpu_isset(smp_processor_id(), smp_commenced_mask)) rep_nop(); + /* + * Check TSC synchronization with the BP: + */ + check_tsc_sync_target(); + setup_secondary_APIC_clock(); if (nmi_watchdog == NMI_IO_APIC) { disable_8259A_irq(0); enable_NMI_through_LVT0(NULL); enable_8259A_irq(0); } - enable_APIC_timer(); /* * low-memory mappings have been cleared, flush them from * the local TLBs too. @@ -735,7 +580,7 @@ wakeup_secondary_cpu(int logical_apicid, /* * Due to the Pentium erratum 3AP. */ - maxlvt = get_maxlvt(); + maxlvt = lapic_get_maxlvt(); if (maxlvt > 3) { apic_read_around(APIC_SPIV); apic_write(APIC_ESR, 0); @@ -825,7 +670,7 @@ wakeup_secondary_cpu(int phys_apicid, un */ Dprintk("#startup loops: %d.\n", num_starts); - maxlvt = get_maxlvt(); + maxlvt = lapic_get_maxlvt(); for (j = 1; j <= num_starts; j++) { Dprintk("Sending STARTUP #%d.\n",j); @@ -1091,8 +936,6 @@ static int __cpuinit __smp_prepare_cpu(i info.cpu = cpu; INIT_WORK(&task, do_warm_boot_cpu, &info); - tsc_sync_disabled = 1; - /* init low mem mapping */ clone_pgd_range(swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, KERNEL_PGD_PTRS); @@ -1100,7 +943,6 @@ static int __cpuinit __smp_prepare_cpu(i schedule_work(&task); wait_for_completion(&done); - tsc_sync_disabled = 0; zap_low_mappings(); ret = 0; exit: @@ -1316,12 +1158,6 @@ static void __init smp_boot_cpus(unsigne smpboot_setup_io_apic(); setup_boot_APIC_clock(); - - /* - * Synchronize the TSC with the AP - */ - if (cpu_has_tsc && cpucount && cpu_khz) - synchronize_tsc_bp(); } /* These are wrappers to interface to the new boot process. Someone @@ -1456,9 +1292,16 @@ int __devinit __cpu_up(unsigned int cpu) } local_irq_enable(); + per_cpu(cpu_state, cpu) = CPU_UP_PREPARE; /* Unleash the CPU! */ cpu_set(cpu, smp_commenced_mask); + + /* + * Check TSC synchronization with the AP: + */ + check_tsc_sync_source(cpu); + while (!cpu_isset(cpu, cpu_online_map)) cpu_relax(); return 0; Index: linux/arch/i386/kernel/time.c =================================================================== --- linux.orig/arch/i386/kernel/time.c +++ linux/arch/i386/kernel/time.c @@ -128,7 +128,7 @@ static int set_rtc_mmss(unsigned long no int timer_ack; -unsigned long profile_pc(struct pt_regs *regs) +unsigned long notrace profile_pc(struct pt_regs *regs) { unsigned long pc = instruction_pointer(regs); @@ -163,15 +163,6 @@ EXPORT_SYMBOL(profile_pc); */ irqreturn_t timer_interrupt(int irq, void *dev_id) { - /* - * Here we are in the timer irq handler. We just have irqs locally - * disabled but we don't know if the timer_bh is running on the other - * CPU. We need to avoid to SMP race with it. NOTE: we don' t need - * the irq version of write_lock because as just said we have irq - * locally disabled. -arca - */ - write_seqlock(&xtime_lock); - #ifdef CONFIG_X86_IO_APIC if (timer_ack) { /* @@ -190,7 +181,6 @@ irqreturn_t timer_interrupt(int irq, voi do_timer_interrupt_hook(); - if (MCA_bus) { /* The PS/2 uses level-triggered interrupts. You can't turn them off, nor would you want to (any attempt to @@ -205,18 +195,11 @@ irqreturn_t timer_interrupt(int irq, voi outb_p( irq_v|0x80, 0x61 ); /* reset the IRQ */ } - write_sequnlock(&xtime_lock); - -#ifdef CONFIG_X86_LOCAL_APIC - if (using_apic_timer) - smp_send_timer_broadcast_ipi(); -#endif - return IRQ_HANDLED; } /* not static: needed by APM */ -unsigned long get_cmos_time(void) +unsigned long read_persistent_clock(void) { unsigned long retval; unsigned long flags; @@ -232,7 +215,6 @@ unsigned long get_cmos_time(void) return retval; } -EXPORT_SYMBOL(get_cmos_time); static void sync_cmos_clock(unsigned long dummy); @@ -283,89 +265,11 @@ void notify_arch_cmos_timer(void) mod_timer(&sync_cmos_timer, jiffies + 1); } -static long clock_cmos_diff; -static unsigned long sleep_start; - -static int timer_suspend(struct sys_device *dev, pm_message_t state) -{ - /* - * Estimate time zone so that set_time can update the clock - */ - unsigned long ctime = get_cmos_time(); - - clock_cmos_diff = -ctime; - clock_cmos_diff += get_seconds(); - sleep_start = ctime; - return 0; -} - -static int timer_resume(struct sys_device *dev) -{ - unsigned long flags; - unsigned long sec; - unsigned long ctime = get_cmos_time(); - long sleep_length = (ctime - sleep_start) * HZ; - struct timespec ts; - - if (sleep_length < 0) { - printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n"); - /* The time after the resume must not be earlier than the time - * before the suspend or some nasty things will happen - */ - sleep_length = 0; - ctime = sleep_start; - } -#ifdef CONFIG_HPET_TIMER - if (is_hpet_enabled()) - hpet_reenable(); -#endif - setup_pit_timer(); - - sec = ctime + clock_cmos_diff; - ts.tv_sec = sec; - ts.tv_nsec = 0; - do_settimeofday(&ts); - write_seqlock_irqsave(&xtime_lock, flags); - jiffies_64 += sleep_length; - write_sequnlock_irqrestore(&xtime_lock, flags); - touch_softlockup_watchdog(); - return 0; -} - -static struct sysdev_class timer_sysclass = { - .resume = timer_resume, - .suspend = timer_suspend, - set_kset_name("timer"), -}; - - -/* XXX this driverfs stuff should probably go elsewhere later -john */ -static struct sys_device device_timer = { - .id = 0, - .cls = &timer_sysclass, -}; - -static int time_init_device(void) -{ - int error = sysdev_class_register(&timer_sysclass); - if (!error) - error = sysdev_register(&device_timer); - return error; -} - -device_initcall(time_init_device); - #ifdef CONFIG_HPET_TIMER extern void (*late_time_init)(void); /* Duplicate of time_init() below, with hpet_enable part added */ static void __init hpet_time_init(void) { - struct timespec ts; - ts.tv_sec = get_cmos_time(); - ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); - - do_settimeofday(&ts); - if ((hpet_enable() >= 0) && hpet_use_timer) { printk("Using HPET for base-timer\n"); } @@ -376,7 +280,6 @@ static void __init hpet_time_init(void) void __init time_init(void) { - struct timespec ts; #ifdef CONFIG_HPET_TIMER if (is_hpet_capable()) { /* @@ -387,10 +290,5 @@ void __init time_init(void) return; } #endif - ts.tv_sec = get_cmos_time(); - ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); - - do_settimeofday(&ts); - time_init_hook(); } Index: linux/arch/i386/kernel/time_hpet.c =================================================================== --- linux.orig/arch/i386/kernel/time_hpet.c +++ linux/arch/i386/kernel/time_hpet.c @@ -28,7 +28,10 @@ unsigned long hpet_address; /* hpet mem int hpet_use_timer; static int use_hpet; /* can be used for runtime check of hpet */ -static int boot_hpet_disable; /* boottime override for HPET timer */ +/* + * TODO: disable for now: + */ +static int boot_hpet_disable = 1; /* boottime override for HPET timer */ static void __iomem * hpet_virt_address; /* hpet kernel virtual address */ #define FSEC_TO_USEC (1000000000UL) @@ -43,23 +46,6 @@ static void hpet_writel(unsigned long d, writel(d, hpet_virt_address + a); } -#ifdef CONFIG_X86_LOCAL_APIC -/* - * HPET counters dont wrap around on every tick. They just change the - * comparator value and continue. Next tick can be caught by checking - * for a change in the comparator value. Used in apic.c. - */ -static void __devinit wait_hpet_tick(void) -{ - unsigned int start_cmp_val, end_cmp_val; - - start_cmp_val = hpet_readl(HPET_T0_CMP); - do { - end_cmp_val = hpet_readl(HPET_T0_CMP); - } while (start_cmp_val == end_cmp_val); -} -#endif - static int hpet_timer_stop_set_go(unsigned long tick) { unsigned int cfg; @@ -204,11 +190,6 @@ int __init hpet_enable(void) hpet_alloc(&hd); } #endif - -#ifdef CONFIG_X86_LOCAL_APIC - if (hpet_use_timer) - wait_timer_tick = wait_hpet_tick; -#endif return 0; } Index: linux/arch/i386/kernel/traps.c =================================================================== --- linux.orig/arch/i386/kernel/traps.c +++ linux/arch/i386/kernel/traps.c @@ -100,18 +100,18 @@ static int call_trace = 1; #else #define call_trace (-1) #endif -ATOMIC_NOTIFIER_HEAD(i386die_chain); +RAW_NOTIFIER_HEAD(i386die_chain); int register_die_notifier(struct notifier_block *nb) { vmalloc_sync_all(); - return atomic_notifier_chain_register(&i386die_chain, nb); + return raw_notifier_chain_register(&i386die_chain, nb); } EXPORT_SYMBOL(register_die_notifier); /* used modular by kdb */ int unregister_die_notifier(struct notifier_block *nb) { - return atomic_notifier_chain_unregister(&i386die_chain, nb); + return raw_notifier_chain_unregister(&i386die_chain, nb); } EXPORT_SYMBOL(unregister_die_notifier); /* used modular by kdb */ @@ -322,6 +322,7 @@ static void show_stack_log_lvl(struct ta } printk("\n%sCall Trace:\n", log_lvl); show_trace_log_lvl(task, regs, esp, log_lvl); + debug_show_held_locks(task); } void show_stack(struct task_struct *task, unsigned long *esp) @@ -342,6 +343,12 @@ void dump_stack(void) EXPORT_SYMBOL(dump_stack); +#if defined(CONFIG_DEBUG_STACKOVERFLOW) && defined(CONFIG_EVENT_TRACE) +extern unsigned long worst_stack_left; +#else +# define worst_stack_left -1L +#endif + void show_registers(struct pt_regs *regs) { int i; @@ -369,8 +376,8 @@ void show_registers(struct pt_regs *regs regs->eax, regs->ebx, regs->ecx, regs->edx); printk(KERN_EMERG "esi: %08lx edi: %08lx ebp: %08lx esp: %08lx\n", regs->esi, regs->edi, regs->ebp, esp); - printk(KERN_EMERG "ds: %04x es: %04x ss: %04x\n", - regs->xds & 0xffff, regs->xes & 0xffff, ss); + printk(KERN_EMERG "ds: %04x es: %04x ss: %04x preempt: %08x\n", + regs->xds & 0xffff, regs->xes & 0xffff, ss, preempt_count()); printk(KERN_EMERG "Process %.*s (pid: %d, ti=%p task=%p task.ti=%p)", TASK_COMM_LEN, current->comm, current->pid, current_thread_info(), current, current->thread_info); @@ -448,11 +455,11 @@ static void handle_BUG(struct pt_regs *r void die(const char * str, struct pt_regs * regs, long err) { static struct { - spinlock_t lock; + raw_spinlock_t lock; u32 lock_owner; int lock_owner_depth; } die = { - .lock = SPIN_LOCK_UNLOCKED, + .lock = RAW_SPIN_LOCK_UNLOCKED(die.lock), .lock_owner = -1, .lock_owner_depth = 0 }; @@ -559,6 +566,11 @@ static void __kprobes do_trap(int trapnr if (!user_mode(regs)) goto kernel_trap; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif + trap_signal: { if (info) force_sig_info(signr, info, tsk); @@ -578,6 +590,7 @@ static void __kprobes do_trap(int trapnr if (ret) goto trap_signal; return; } + print_traces(tsk); } #define DO_ERROR(trapnr, signr, str, name) \ @@ -786,10 +799,11 @@ void __kprobes die_nmi(struct pt_regs *r crash_kexec(regs); } + nmi_exit(); do_exit(SIGSEGV); } -static __kprobes void default_do_nmi(struct pt_regs * regs) +static notrace __kprobes void default_do_nmi(struct pt_regs * regs) { unsigned char reason = 0; @@ -827,11 +841,12 @@ static __kprobes void default_do_nmi(str reassert_nmi(); } -fastcall __kprobes void do_nmi(struct pt_regs * regs, long error_code) +fastcall notrace __kprobes void do_nmi(struct pt_regs * regs, long error_code) { int cpu; nmi_enter(); + nmi_trace((unsigned long)do_nmi, regs->eip, regs->eflags); cpu = smp_processor_id(); Index: linux/arch/i386/kernel/tsc.c =================================================================== --- linux.orig/arch/i386/kernel/tsc.c +++ linux/arch/i386/kernel/tsc.c @@ -10,7 +10,9 @@ #include #include #include +#include +#include #include #include #include @@ -333,6 +335,16 @@ static cycle_t read_tsc(void) return ret; } + +static cycle_t __vsyscall_fn vread_tsc(void) +{ + cycle_t ret; + + rdtscll(ret); + + return ret; +} + static struct clocksource clocksource_tsc = { .name = "tsc", .rating = 300, @@ -342,6 +354,7 @@ static struct clocksource clocksource_ts .shift = 22, .update_callback = tsc_update_callback, .is_continuous = 1, + .vread = vread_tsc, }; static int tsc_update_callback(void) @@ -389,42 +402,51 @@ static struct dmi_system_id __initdata b #define TSC_FREQ_CHECK_INTERVAL (10*MSEC_PER_SEC) /* 10sec in MS */ static struct timer_list verify_tsc_freq_timer; +static unsigned long pm_multiplier; /* XXX - Probably should add locking */ static void verify_tsc_freq(unsigned long unused) { static u64 last_tsc; - static unsigned long last_jiffies; + static unsigned long last_jiffies, last_pm; u64 now_tsc, interval_tsc; - unsigned long now_jiffies, interval_jiffies; - + unsigned long now_jiffies, interval_jiffies, pm, pm_delta; if (check_tsc_unstable()) return; rdtscll(now_tsc); now_jiffies = jiffies; + pm = acpi_pm_read_early(); - if (!last_jiffies) { + if (!last_jiffies) goto out; - } interval_jiffies = now_jiffies - last_jiffies; - interval_tsc = now_tsc - last_tsc; - interval_tsc *= HZ; - do_div(interval_tsc, cpu_khz*1000); + + if (pm == last_pm) { + interval_tsc = now_tsc - last_tsc; + interval_tsc *= HZ; + do_div(interval_tsc, cpu_khz*1000); + } else { + if (pm < last_pm) + pm += ACPI_PM_OVRRUN; + pm_delta = pm - last_pm; + interval_tsc = (((u64) pm_delta) * pm_multiplier) >> 22; + do_div(interval_tsc, TICK_NSEC); + } if (interval_tsc < (interval_jiffies * 3 / 4)) { printk("TSC appears to be running slowly. " - "Marking it as unstable\n"); + "Marking it as unstable\n"); mark_tsc_unstable(); return; } - out: last_tsc = now_tsc; last_jiffies = now_jiffies; + last_pm = pm; /* set us up to go off on the next interval: */ mod_timer(&verify_tsc_freq_timer, jiffies + msecs_to_jiffies(TSC_FREQ_CHECK_INTERVAL)); @@ -434,8 +456,10 @@ out: * Make an educated guess if the TSC is trustworthy and synchronized * over all CPUs. */ -static __init int unsynchronized_tsc(void) +__cpuinit int unsynchronized_tsc(void) { + if (!cpu_has_tsc || tsc_unstable) + return 1; /* * Intel systems are normally all synchronized. * Exceptions must mark TSC as unstable: @@ -463,6 +487,21 @@ static int __init init_tsc_clocksource(v if (check_tsc_unstable()) clocksource_tsc.rating = 0; + /* + * Mark TSC unsuitable for high resolution timers, when + * pm_timer is not available as a fallback. TSC has so many + * pitfalls: frequency changes, stop in idle ... When we + * switch to high resolution mode we can not longer detect a + * firmware caused frequency change, as the emulated tick uses + * TSC as reference. This results in a circular dependency. + * Switch only to high resolution mode, if pm_timer or such is + * available. + */ + if (!pmtmr_ioport || !clocksource_tsc.rating) + clocksource_tsc.is_continuous = 0; + + pm_multiplier = clocksource_hz2mult(PMTMR_TICKS_PER_SEC, 22); + init_timer(&verify_tsc_freq_timer); verify_tsc_freq_timer.function = verify_tsc_freq; verify_tsc_freq_timer.expires = Index: linux/arch/i386/kernel/tsc_sync.c =================================================================== --- /dev/null +++ linux/arch/i386/kernel/tsc_sync.c @@ -0,0 +1 @@ +#include "../../x86_64/kernel/tsc_sync.c" Index: linux/arch/i386/kernel/vm86.c =================================================================== --- linux.orig/arch/i386/kernel/vm86.c +++ linux/arch/i386/kernel/vm86.c @@ -109,6 +109,7 @@ struct pt_regs * fastcall save_v86_state local_irq_enable(); if (!current->thread.vm86_info) { + local_irq_disable(); printk("no vm86_info: BAD\n"); do_exit(SIGSEGV); } Index: linux/arch/i386/kernel/vmlinux.lds.S =================================================================== --- linux.orig/arch/i386/kernel/vmlinux.lds.S +++ linux/arch/i386/kernel/vmlinux.lds.S @@ -8,6 +8,7 @@ #include #include #include +#include OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) @@ -78,6 +79,51 @@ SECTIONS .data.read_mostly : AT(ADDR(.data.read_mostly) - LOAD_OFFSET) { *(.data.read_mostly) } _edata = .; /* End of data section */ +/* VSYSCALL_GTOD data */ +#ifdef CONFIG_GENERIC_TIME_VSYSCALL +#undef VSYSCALL_ADDR +#define VSYSCALL_ADDR VSYSCALL_GTOD_START +#define VSYSCALL_PHYS_ADDR ((LOADADDR(.data.read_mostly) + SIZEOF(.data.read_mostly) + 4095) & ~(4095)) +#define VSYSCALL_VIRT_ADDR ((ADDR(.data.read_mostly) + SIZEOF(.data.read_mostly) + 4095) & ~(4095)) + +#define VLOAD_OFFSET (VSYSCALL_ADDR - VSYSCALL_PHYS_ADDR) +#define VLOAD(x) (ADDR(x) - VLOAD_OFFSET) + +#define VVIRT_OFFSET (VSYSCALL_ADDR - VSYSCALL_VIRT_ADDR) +#define VVIRT(x) (ADDR(x) - VVIRT_OFFSET) + + . = VSYSCALL_ADDR; + .vsyscall_0 : AT(VSYSCALL_PHYS_ADDR) { *(.vsyscall_0) } + __vsyscall_0 = VSYSCALL_VIRT_ADDR; + + .vsyscall_fn : AT(VLOAD(.vsyscall_fn)) { *(.vsyscall_fn) } + .vsyscall_data : AT(VLOAD(.vsyscall_data)) { *(.vsyscall_data) } + + . = ALIGN(32); + .vsyscall_gtod_data : AT (VLOAD(.vsyscall_gtod_data)) { *(.vsyscall_gtod_data) } + vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data); + + . = ALIGN(32); + .vsyscall_gtod_lock : AT (VLOAD(.vsyscall_gtod_lock)) { *(.vsyscall_gtod_lock) } + vsyscall_gtod_lock = VVIRT(.vsyscall_gtod_lock); + + .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } + .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) } + .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) } + + . = VSYSCALL_VIRT_ADDR + 4096; + +#undef VSYSCALL_ADDR +#undef VSYSCALL_PHYS_ADDR +#undef VSYSCALL_VIRT_ADDR +#undef VLOAD_OFFSET +#undef VLOAD +#undef VVIRT_OFFSET +#undef VVIRT + +#endif +/* END of VSYSCALL_GTOD data*/ + #ifdef CONFIG_STACK_UNWIND . = ALIGN(4); .eh_frame : AT(ADDR(.eh_frame) - LOAD_OFFSET) { Index: linux/arch/i386/kernel/vsyscall-gtod.c =================================================================== --- /dev/null +++ linux/arch/i386/kernel/vsyscall-gtod.c @@ -0,0 +1,179 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct vsyscall_gtod_data_t { + struct timeval wall_time_tv; + struct timezone sys_tz; + struct clocksource clock; +}; + +struct vsyscall_gtod_data_t vsyscall_gtod_data; +struct vsyscall_gtod_data_t __vsyscall_gtod_data __section_vsyscall_gtod_data; + +seqlock_t vsyscall_gtod_lock = SEQLOCK_UNLOCKED; +seqlock_t __vsyscall_gtod_lock __section_vsyscall_gtod_lock = SEQLOCK_UNLOCKED; + +int errno; +static inline _syscall2(int,gettimeofday,struct timeval *,tv,struct timezone *,tz); + +static int vsyscall_mapped = 0; /* flag variable for remap_vsyscall() */ +extern struct timezone sys_tz; + +static inline void do_vgettimeofday(struct timeval* tv) +{ + cycle_t now, cycle_delta; + s64 nsec_delta; + + if (!__vsyscall_gtod_data.clock.vread) { + gettimeofday(tv, NULL); + return; + } + + /* read the clocksource and calc cycle_delta */ + now = __vsyscall_gtod_data.clock.vread(); + cycle_delta = (now - __vsyscall_gtod_data.clock.cycle_last) + & __vsyscall_gtod_data.clock.mask; + + /* convert cycles to nsecs */ + nsec_delta = cycle_delta * __vsyscall_gtod_data.clock.mult; + nsec_delta = nsec_delta >> __vsyscall_gtod_data.clock.shift; + + /* add nsec offset to wall_time_tv */ + *tv = __vsyscall_gtod_data.wall_time_tv; + do_div(nsec_delta, NSEC_PER_USEC); + while (nsec_delta > NSEC_PER_SEC) { + tv->tv_sec += 1; + nsec_delta -= NSEC_PER_SEC; + } + tv->tv_usec += ((unsigned long)nsec_delta)/1000; +} + +static inline void do_get_tz(struct timezone *tz) +{ + *tz = __vsyscall_gtod_data.sys_tz; +} + +static int __vsyscall(0) asmlinkage vgettimeofday(struct timeval *tv, struct timezone *tz) +{ + unsigned long seq; + do { + seq = read_seqbegin(&__vsyscall_gtod_lock); + + if (tv) + do_vgettimeofday(tv); + if (tz) + do_get_tz(tz); + + } while (read_seqretry(&__vsyscall_gtod_lock, seq)); + + return 0; +} + +static time_t __vsyscall(1) asmlinkage vtime(time_t * t) +{ + struct timeval tv; + vgettimeofday(&tv,NULL); + if (t) + *t = tv.tv_sec; + return tv.tv_sec; +} + +struct clocksource* curr_clock; + +void update_vsyscall(struct timespec *wall_time, + struct clocksource* clock) +{ + unsigned long flags; + + write_seqlock_irqsave(&vsyscall_gtod_lock, flags); + + /* XXX - hackitty hack hack. this is terrible! */ + if (curr_clock != clock) { + curr_clock = clock; + } + + /* save off wall time as timeval */ + vsyscall_gtod_data.wall_time_tv.tv_sec = wall_time->tv_sec; + vsyscall_gtod_data.wall_time_tv.tv_usec = wall_time->tv_nsec/1000; + + /* copy current clocksource */ + vsyscall_gtod_data.clock = *clock; + + /* save off current timezone */ + vsyscall_gtod_data.sys_tz = sys_tz; + + write_sequnlock_irqrestore(&vsyscall_gtod_lock, flags); + +} +extern char __vsyscall_0; + +static void __init map_vsyscall(void) +{ + unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET; + + /* Initially we map the VSYSCALL page w/ PAGE_KERNEL permissions to + * keep the alternate_instruction code from bombing out when it + * changes the seq_lock memory barriers in vgettimeofday() + */ + __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL); +} + +static int __init remap_vsyscall(void) +{ + unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET; + + if (!vsyscall_mapped) + return 0; + + /* Remap the VSYSCALL page w/ PAGE_KERNEL_VSYSCALL permissions + * after the alternate_instruction code has run + */ + clear_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE); + __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL); + + return 0; +} + +int __init vsyscall_init(void) +{ + printk("VSYSCALL: consistency checks..."); + if ((unsigned long) &vgettimeofday != VSYSCALL_ADDR(__NR_vgettimeofday)) { + printk("vgettimeofday link addr broken\n"); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + if ((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime)) { + printk("vtime link addr broken\n"); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + if (VSYSCALL_ADDR(0) != __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE)) { + printk("fixmap first vsyscall 0x%lx should be 0x%x\n", + __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE), + VSYSCALL_ADDR(0)); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + + + printk("passed...mapping..."); + map_vsyscall(); + printk("done.\n"); + vsyscall_mapped = 1; + printk("VSYSCALL: fixmap virt addr: 0x%lx\n", + __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE)); + + return 0; +} +__initcall(remap_vsyscall); Index: linux/arch/i386/lib/semaphore.S =================================================================== --- linux.orig/arch/i386/lib/semaphore.S +++ linux/arch/i386/lib/semaphore.S @@ -30,7 +30,7 @@ * value or just clobbered.. */ .section .sched.text -ENTRY(__down_failed) +ENTRY(__compat_down_failed) CFI_STARTPROC FRAME pushl %edx @@ -39,7 +39,7 @@ ENTRY(__down_failed) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down + call __compat_down popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -49,9 +49,9 @@ ENTRY(__down_failed) ENDFRAME ret CFI_ENDPROC - END(__down_failed) + END(__compat_down_failed) -ENTRY(__down_failed_interruptible) +ENTRY(__compat_down_failed_interruptible) CFI_STARTPROC FRAME pushl %edx @@ -60,7 +60,7 @@ ENTRY(__down_failed_interruptible) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down_interruptible + call __compat_down_interruptible popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -70,9 +70,9 @@ ENTRY(__down_failed_interruptible) ENDFRAME ret CFI_ENDPROC - END(__down_failed_interruptible) + END(__compat_down_failed_interruptible) -ENTRY(__down_failed_trylock) +ENTRY(__compat_down_failed_trylock) CFI_STARTPROC FRAME pushl %edx @@ -81,7 +81,7 @@ ENTRY(__down_failed_trylock) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down_trylock + call __compat_down_trylock popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -91,9 +91,9 @@ ENTRY(__down_failed_trylock) ENDFRAME ret CFI_ENDPROC - END(__down_failed_trylock) + END(__compat_down_failed_trylock) -ENTRY(__up_wakeup) +ENTRY(__compat_up_wakeup) CFI_STARTPROC FRAME pushl %edx @@ -102,7 +102,7 @@ ENTRY(__up_wakeup) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __up + call __compat_up popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -112,7 +112,7 @@ ENTRY(__up_wakeup) ENDFRAME ret CFI_ENDPROC - END(__up_wakeup) + END(__compat_up_wakeup) /* * rw spinlock fallbacks Index: linux/arch/i386/mach-default/setup.c =================================================================== --- linux.orig/arch/i386/mach-default/setup.c +++ linux/arch/i386/mach-default/setup.c @@ -35,7 +35,7 @@ void __init pre_intr_init_hook(void) /* * IRQ2 is cascade interrupt to second interrupt controller */ -static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL}; +static struct irqaction irq2 = { no_action, IRQF_NODELAY, CPU_MASK_NONE, "cascade", NULL, NULL}; /** * intr_init_hook - post gate setup interrupt initialisation @@ -79,7 +79,7 @@ void __init trap_init_hook(void) { } -static struct irqaction irq0 = { timer_interrupt, IRQF_DISABLED, CPU_MASK_NONE, "timer", NULL, NULL}; +static struct irqaction irq0 = { timer_interrupt, IRQF_DISABLED | IRQF_NODELAY, CPU_MASK_NONE, "timer", NULL, NULL}; /** * time_init_hook - do any specific initialisations for the system timer. Index: linux/arch/i386/mach-es7000/es7000plat.c =================================================================== --- linux.orig/arch/i386/mach-es7000/es7000plat.c +++ linux/arch/i386/mach-es7000/es7000plat.c @@ -164,7 +164,7 @@ find_unisys_acpi_oem_table(unsigned long unsigned long rsdp_phys = 0; struct acpi_table_header *header = NULL; int i; - struct acpi_table_sdt sdt; + struct acpi_table_sdt sdt = { } /* shut up gcc */; rsdp_phys = acpi_find_rsdp(); rsdp = __va(rsdp_phys); Index: linux/arch/i386/mach-visws/setup.c =================================================================== --- linux.orig/arch/i386/mach-visws/setup.c +++ linux/arch/i386/mach-visws/setup.c @@ -115,7 +115,7 @@ void __init pre_setup_arch_hook() static struct irqaction irq0 = { .handler = timer_interrupt, - .flags = IRQF_DISABLED, + .flags = IRQF_DISABLED | IRQF_NODELAY, .name = "timer", }; Index: linux/arch/i386/mach-visws/visws_apic.c =================================================================== --- linux.orig/arch/i386/mach-visws/visws_apic.c +++ linux/arch/i386/mach-visws/visws_apic.c @@ -258,11 +258,13 @@ out_unlock: static struct irqaction master_action = { .handler = piix4_master_intr, .name = "PIIX4-8259", + .flags = IRQF_NODELAY, }; static struct irqaction cascade_action = { .handler = no_action, .name = "cascade", + .flags = IRQF_NODELAY, }; Index: linux/arch/i386/mach-voyager/setup.c =================================================================== --- linux.orig/arch/i386/mach-voyager/setup.c +++ linux/arch/i386/mach-voyager/setup.c @@ -18,7 +18,7 @@ void __init pre_intr_init_hook(void) /* * IRQ2 is cascade interrupt to second interrupt controller */ -static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL}; +static struct irqaction irq2 = { no_action, IRQF_NODELAY, CPU_MASK_NONE, "cascade", NULL, NULL}; void __init intr_init_hook(void) { @@ -40,7 +40,7 @@ void __init trap_init_hook(void) { } -static struct irqaction irq0 = { timer_interrupt, IRQF_DISABLED, CPU_MASK_NONE, "timer", NULL, NULL}; +static struct irqaction irq0 = { timer_interrupt, IRQF_DISABLED | IRQF_NODELAY, CPU_MASK_NONE, "timer", NULL, NULL}; void __init time_init_hook(void) { Index: linux/arch/i386/mm/fault.c =================================================================== --- linux.orig/arch/i386/mm/fault.c +++ linux/arch/i386/mm/fault.c @@ -68,6 +68,9 @@ void bust_spinlocks(int yes) int loglevel_save = console_loglevel; if (yes) { + stop_trace(); + user_trace_stop(); + zap_rt_locks(); oops_in_progress = 1; return; } @@ -320,8 +323,8 @@ static inline int vmalloc_fault(unsigned * bit 3 == 1 means use of reserved bit detected * bit 4 == 1 means fault was an instruction fetch */ -fastcall void __kprobes do_page_fault(struct pt_regs *regs, - unsigned long error_code) +fastcall notrace void __kprobes do_page_fault(struct pt_regs *regs, + unsigned long error_code) { struct task_struct *tsk; struct mm_struct *mm; @@ -332,6 +335,7 @@ fastcall void __kprobes do_page_fault(st /* get the address */ address = read_cr2(); + trace_special(regs->eip, error_code, address); tsk = current; Index: linux/arch/i386/mm/highmem.c =================================================================== --- linux.orig/arch/i386/mm/highmem.c +++ linux/arch/i386/mm/highmem.c @@ -18,6 +18,26 @@ void kunmap(struct page *page) kunmap_high(page); } +void kunmap_virt(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return; + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + kunmap(page); +} + +struct page *kmap_to_page(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return virt_to_page(ptr); + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + return page; +} + /* * kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because * no global lock is needed and because the kmap code must perform a global TLB @@ -26,7 +46,7 @@ void kunmap(struct page *page) * However when holding an atomic kmap is is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. */ -void *kmap_atomic(struct page *page, enum km_type type) +void *__kmap_atomic(struct page *page, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; @@ -45,7 +65,7 @@ void *kmap_atomic(struct page *page, enu return (void*) vaddr; } -void kunmap_atomic(void *kvaddr, enum km_type type) +void __kunmap_atomic(void *kvaddr, enum km_type type) { unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); @@ -75,7 +95,7 @@ void kunmap_atomic(void *kvaddr, enum km /* This is the same as kmap_atomic() but can map memory that doesn't * have a struct page associated with it. */ -void *kmap_atomic_pfn(unsigned long pfn, enum km_type type) +void *__kmap_atomic_pfn(unsigned long pfn, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; @@ -89,7 +109,7 @@ void *kmap_atomic_pfn(unsigned long pfn, return (void*) vaddr; } -struct page *kmap_atomic_to_page(void *ptr) +struct page *__kmap_atomic_to_page(void *ptr) { unsigned long idx, vaddr = (unsigned long)ptr; pte_t *pte; @@ -104,6 +124,7 @@ struct page *kmap_atomic_to_page(void *p EXPORT_SYMBOL(kmap); EXPORT_SYMBOL(kunmap); -EXPORT_SYMBOL(kmap_atomic); -EXPORT_SYMBOL(kunmap_atomic); -EXPORT_SYMBOL(kmap_atomic_to_page); +EXPORT_SYMBOL(kunmap_virt); +EXPORT_SYMBOL(__kmap_atomic); +EXPORT_SYMBOL(__kunmap_atomic); +EXPORT_SYMBOL(__kmap_atomic_to_page); Index: linux/arch/i386/mm/init.c =================================================================== --- linux.orig/arch/i386/mm/init.c +++ linux/arch/i386/mm/init.c @@ -45,7 +45,7 @@ unsigned int __VMALLOC_RESERVE = 128 << 20; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; static int noinline do_test_wp_bit(void); @@ -194,7 +194,7 @@ static inline int page_kills_ppro(unsign extern int is_available_memory(efi_memory_desc_t *); -int page_is_ram(unsigned long pagenr) +int notrace page_is_ram(unsigned long pagenr) { int i; unsigned long addr, end; Index: linux/arch/i386/mm/pgtable.c =================================================================== --- linux.orig/arch/i386/mm/pgtable.c +++ linux/arch/i386/mm/pgtable.c @@ -210,7 +210,7 @@ void pmd_ctor(void *pmd, kmem_cache_t *c * recommendations and having no core impact whatsoever. * -- wli */ -DEFINE_SPINLOCK(pgd_lock); +DEFINE_RAW_SPINLOCK(pgd_lock); struct page *pgd_list; static inline void pgd_list_add(pgd_t *pgd) Index: linux/arch/i386/oprofile/Kconfig =================================================================== --- linux.orig/arch/i386/oprofile/Kconfig +++ linux/arch/i386/oprofile/Kconfig @@ -15,3 +15,6 @@ config OPROFILE If unsure, say N. +config PROFILE_NMI + bool + default y Index: linux/arch/i386/pci/Makefile =================================================================== --- linux.orig/arch/i386/pci/Makefile +++ linux/arch/i386/pci/Makefile @@ -4,8 +4,9 @@ obj-$(CONFIG_PCI_BIOS) += pcbios.o obj-$(CONFIG_PCI_MMCONFIG) += mmconfig.o direct.o obj-$(CONFIG_PCI_DIRECT) += direct.o +obj-$(CONFIG_ACPI) += acpi.o + pci-y := fixup.o -pci-$(CONFIG_ACPI) += acpi.o pci-y += legacy.o irq.o pci-$(CONFIG_X86_VISWS) := visws.o fixup.o Index: linux/arch/i386/pci/common.c =================================================================== --- linux.orig/arch/i386/pci/common.c +++ linux/arch/i386/pci/common.c @@ -52,7 +52,7 @@ int pcibios_scanned; * This interrupt-safe spinlock protects all accesses to PCI * configuration space. */ -DEFINE_SPINLOCK(pci_config_lock); +DEFINE_RAW_SPINLOCK(pci_config_lock); /* * Several buggy motherboards address only 16 devices and mirror Index: linux/arch/i386/pci/direct.c =================================================================== --- linux.orig/arch/i386/pci/direct.c +++ linux/arch/i386/pci/direct.c @@ -220,16 +220,23 @@ static int __init pci_check_type1(void) unsigned int tmp; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x01, 0xCFB); tmp = inl(0xCF8); outl(0x80000000, 0xCF8); - if (inl(0xCF8) == 0x80000000 && pci_sanity_check(&pci_direct_conf1)) { - works = 1; + + if (inl(0xCF8) == 0x80000000) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf1)) + works = 1; + + spin_lock_irqsave(&pci_config_lock, flags); } outl(tmp, 0xCF8); - local_irq_restore(flags); + + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } @@ -239,17 +246,19 @@ static int __init pci_check_type2(void) unsigned long flags; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x00, 0xCFB); outb(0x00, 0xCF8); outb(0x00, 0xCFA); - if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00 && - pci_sanity_check(&pci_direct_conf2)) { - works = 1; - } - local_irq_restore(flags); + if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf2)) + works = 1; + } else + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } Index: linux/arch/i386/pci/pci.h =================================================================== --- linux.orig/arch/i386/pci/pci.h +++ linux/arch/i386/pci/pci.h @@ -78,7 +78,7 @@ struct irq_routing_table { extern unsigned int pcibios_irq_mask; extern int pcibios_scanned; -extern spinlock_t pci_config_lock; +extern raw_spinlock_t pci_config_lock; extern int (*pcibios_enable_irq)(struct pci_dev *dev); extern void (*pcibios_disable_irq)(struct pci_dev *dev); Index: linux/arch/ia64/Kconfig =================================================================== --- linux.orig/arch/ia64/Kconfig +++ linux/arch/ia64/Kconfig @@ -32,6 +32,7 @@ config SWIOTLB config RWSEM_XCHGADD_ALGORITHM bool + depends on !PREEMPT_RT default y config GENERIC_FIND_NEXT_BIT @@ -42,7 +43,11 @@ config GENERIC_CALIBRATE_DELAY bool default y -config TIME_INTERPOLATION +config GENERIC_TIME + bool + default y + +config GENERIC_TIME_VSYSCALL bool default y @@ -249,6 +254,69 @@ config SMP If you don't know what to do here, say N. + +config GENERIC_TIME + bool + default y + +config HIGH_RES_TIMERS + bool "High-Resolution Timers" + help + + POSIX timers are available by default. This option enables + high-resolution POSIX timers. With this option the resolution + is at least 1 microsecond. High resolution is not free. If + enabled this option will add a small overhead each time a + timer expires that is not on a 1/HZ tick boundary. If no such + timers are used the overhead is nil. + + This option enables two additional POSIX CLOCKS, + CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR. Note that this + option does not change the resolution of CLOCK_REALTIME or + CLOCK_MONOTONIC which remain at 1/HZ resolution. + +config HIGH_RES_RESOLUTION + int "High-Resolution-Timer resolution (nanoseconds)" + depends on HIGH_RES_TIMERS + default 1000 + help + + This sets the resolution of timers accessed with + CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR. Too + fine a resolution (small a number) will usually not + be observable due to normal system latencies. For an + 800 MHZ processor about 10,000 is the recommended maximum + (smallest number). If you don't need that sort of resolution, + higher numbers may generate less overhead. + +choice + prompt "Clock source" + depends on HIGH_RES_TIMERS + default HIGH_RES_TIMER_ITC + help + This option allows you to choose the hardware source in charge + of generating high precision interruptions on your system. + On IA-64 these are: + + + ITC Interval Time Counter 1/CPU clock + HPET High Precision Event Timer ~ (XXX:have to check the spec) + + The ITC timer is available on all the ia64 computers because + it is integrated directly into the processor. However it may not + give correct results on MP machines with processors running + at different clock rates. In this case you may want to use + the HPET if available on your machine. + + +config HIGH_RES_TIMER_ITC + bool "Interval Time Counter/ITC" + +config HIGH_RES_TIMER_HPET + bool "High Precision Event Timer/HPET" + +endchoice + config NR_CPUS int "Maximum number of CPUs (2-1024)" range 2 1024 @@ -301,17 +369,15 @@ config FORCE_CPEI_RETARGET This option it useful to enable this feature on older BIOS's as well. You can also enable this by using boot command line option force_cpei=1. -config PREEMPT - bool "Preemptible Kernel" - help - This option reduces the latency of the kernel when reacting to - real-time or interactive events by allowing a low priority process to - be preempted even if it is in kernel mode executing a system call. - This allows applications to run more reliably even when the system is - under load. +source "kernel/Kconfig.preempt" - Say Y here if you are building a kernel for a desktop, embedded - or real-time system. Say N if you are unsure. +config RWSEM_GENERIC_SPINLOCK + bool + depends on PREEMPT_RT + default y + +config PREEMPT + def_bool y if (PREEMPT_RT || PREEMPT_SOFTIRQS || PREEMPT_HARDIRQS || PREEMPT_VOLUNTARY || PREEMPT_DESKTOP) source "mm/Kconfig" Index: linux/arch/ia64/configs/bigsur_defconfig =================================================================== --- linux.orig/arch/ia64/configs/bigsur_defconfig +++ linux/arch/ia64/configs/bigsur_defconfig @@ -85,7 +85,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/configs/gensparse_defconfig =================================================================== --- linux.orig/arch/ia64/configs/gensparse_defconfig +++ linux/arch/ia64/configs/gensparse_defconfig @@ -86,7 +86,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/configs/sim_defconfig =================================================================== --- linux.orig/arch/ia64/configs/sim_defconfig +++ linux/arch/ia64/configs/sim_defconfig @@ -86,7 +86,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/configs/sn2_defconfig =================================================================== --- linux.orig/arch/ia64/configs/sn2_defconfig +++ linux/arch/ia64/configs/sn2_defconfig @@ -93,7 +93,7 @@ CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_DMI=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y Index: linux/arch/ia64/configs/tiger_defconfig =================================================================== --- linux.orig/arch/ia64/configs/tiger_defconfig +++ linux/arch/ia64/configs/tiger_defconfig @@ -86,7 +86,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/configs/zx1_defconfig =================================================================== --- linux.orig/arch/ia64/configs/zx1_defconfig +++ linux/arch/ia64/configs/zx1_defconfig @@ -84,7 +84,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/defconfig =================================================================== --- linux.orig/arch/ia64/defconfig +++ linux/arch/ia64/defconfig @@ -86,7 +86,7 @@ CONFIG_MMU=y CONFIG_SWIOTLB=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_EFI=y CONFIG_GENERIC_IOMAP=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y Index: linux/arch/ia64/kernel/asm-offsets.c =================================================================== --- linux.orig/arch/ia64/kernel/asm-offsets.c +++ linux/arch/ia64/kernel/asm-offsets.c @@ -7,6 +7,7 @@ #define ASM_OFFSETS_C 1 #include +#include #include #include @@ -254,18 +255,13 @@ void foo(void) offsetof (struct pal_min_state_area_s, pmsa_xip)); BLANK(); +#ifdef CONFIG_TIME_INTERPOLATION /* used by fsys_gettimeofday in arch/ia64/kernel/fsys.S */ - DEFINE(IA64_TIME_INTERPOLATOR_ADDRESS_OFFSET, offsetof (struct time_interpolator, addr)); - DEFINE(IA64_TIME_INTERPOLATOR_SOURCE_OFFSET, offsetof (struct time_interpolator, source)); - DEFINE(IA64_TIME_INTERPOLATOR_SHIFT_OFFSET, offsetof (struct time_interpolator, shift)); - DEFINE(IA64_TIME_INTERPOLATOR_NSEC_OFFSET, offsetof (struct time_interpolator, nsec_per_cyc)); - DEFINE(IA64_TIME_INTERPOLATOR_OFFSET_OFFSET, offsetof (struct time_interpolator, offset)); - DEFINE(IA64_TIME_INTERPOLATOR_LAST_CYCLE_OFFSET, offsetof (struct time_interpolator, last_cycle)); - DEFINE(IA64_TIME_INTERPOLATOR_LAST_COUNTER_OFFSET, offsetof (struct time_interpolator, last_counter)); - DEFINE(IA64_TIME_INTERPOLATOR_JITTER_OFFSET, offsetof (struct time_interpolator, jitter)); - DEFINE(IA64_TIME_INTERPOLATOR_MASK_OFFSET, offsetof (struct time_interpolator, mask)); - DEFINE(IA64_TIME_SOURCE_CPU, TIME_SOURCE_CPU); - DEFINE(IA64_TIME_SOURCE_MMIO64, TIME_SOURCE_MMIO64); - DEFINE(IA64_TIME_SOURCE_MMIO32, TIME_SOURCE_MMIO32); DEFINE(IA64_TIMESPEC_TV_NSEC_OFFSET, offsetof (struct timespec, tv_nsec)); + DEFINE(IA64_CLOCKSOURCE_MASK_OFFSET, offsetof (struct clocksource, mask)); + DEFINE(IA64_CLOCKSOURCE_MULT_OFFSET, offsetof (struct clocksource, mult)); + DEFINE(IA64_CLOCKSOURCE_SHIFT_OFFSET, offsetof (struct clocksource, shift)); + DEFINE(IA64_CLOCKSOURCE_MMIO_PTR_OFFSET, offsetof (struct clocksource, fsys_mmio_ptr)); + DEFINE(IA64_CLOCKSOURCE_CYCLE_LAST_OFFSET, offsetof (struct clocksource, cycle_last)); +#endif } Index: linux/arch/ia64/kernel/cyclone.c =================================================================== --- linux.orig/arch/ia64/kernel/cyclone.c +++ linux/arch/ia64/kernel/cyclone.c @@ -3,6 +3,7 @@ #include #include #include +#include #include /* IBM Summit (EXA) Cyclone counter code*/ @@ -18,13 +19,21 @@ void __init cyclone_setup(void) use_cyclone = 1; } +static void __iomem *cyclone_mc_ptr; -struct time_interpolator cyclone_interpolator = { - .source = TIME_SOURCE_MMIO64, - .shift = 16, - .frequency = CYCLONE_TIMER_FREQ, - .drift = -100, - .mask = (1LL << 40) - 1 +static cycle_t read_cyclone(void) +{ + return (cycle_t)readq((void __iomem *)cyclone_mc_ptr); +} + +static struct clocksource clocksource_cyclone = { + .name = "cyclone", + .rating = 300, + .read = read_cyclone, + .mask = (1LL << 40) - 1, + .mult = 0, /*to be caluclated*/ + .shift = 16, + .is_continuous = 1, }; int __init init_cyclone_clock(void) @@ -101,8 +110,10 @@ int __init init_cyclone_clock(void) } } /* initialize last tick */ - cyclone_interpolator.addr = cyclone_timer; - register_time_interpolator(&cyclone_interpolator); + clocksource_cyclone.fsys_mmio_ptr = cyclone_mc_ptr = cyclone_timer; + clocksource_cyclone.mult = clocksource_hz2mult(CYCLONE_TIMER_FREQ, + clocksource_cyclone.shift); + clocksource_register(&clocksource_cyclone); return 0; } Index: linux/arch/ia64/kernel/entry.S =================================================================== --- linux.orig/arch/ia64/kernel/entry.S +++ linux/arch/ia64/kernel/entry.S @@ -1101,23 +1101,24 @@ skip_rbs_switch: st8 [r2]=r8 st8 [r3]=r10 .work_pending: - tbit.z p6,p0=r31,TIF_NEED_RESCHED // current_thread_info()->need_resched==0? + tbit.nz p6,p0=r31,TIF_NEED_RESCHED // current_thread_info()->need_resched==0? +(p6) br.cond.sptk.few .needresched + tbit.z p6,p0=r31,TIF_NEED_RESCHED_DELAYED // current_thread_info()->need_resched_delayed==0? (p6) br.cond.sptk.few .notify -#ifdef CONFIG_PREEMPT -(pKStk) dep r21=-1,r0,PREEMPT_ACTIVE_BIT,1 + +.needresched: + +(pKStk) br.cond.sptk.many .fromkernel ;; -(pKStk) st4 [r20]=r21 ssm psr.i // enable interrupts -#endif br.call.spnt.many rp=schedule -.ret9: cmp.eq p6,p0=r0,r0 // p6 <- 1 - rsm psr.i // disable interrupts - ;; -#ifdef CONFIG_PREEMPT -(pKStk) adds r20=TI_PRE_COUNT+IA64_TASK_SIZE,r13 +.ret9a: rsm psr.i // disable interrupts ;; -(pKStk) st4 [r20]=r0 // preempt_count() <- 0 -#endif + br.cond.sptk.many .endpreemptdep +.fromkernel: + br.call.spnt.many rp=preempt_schedule_irq +.ret9b: rsm psr.i // disable interrupts +.endpreemptdep: (pLvSys)br.cond.sptk.few .work_pending_syscall_end br.cond.sptk.many .work_processed_kernel // re-check Index: linux/arch/ia64/kernel/fsys.S =================================================================== --- linux.orig/arch/ia64/kernel/fsys.S +++ linux/arch/ia64/kernel/fsys.S @@ -24,6 +24,7 @@ #include "entry.h" +#ifdef CONFIG_TIME_INTERPOLATION /* * See Documentation/ia64/fsys.txt for details on fsyscalls. * @@ -145,13 +146,6 @@ ENTRY(fsys_set_tid_address) FSYS_RETURN END(fsys_set_tid_address) -/* - * Ensure that the time interpolator structure is compatible with the asm code - */ -#if IA64_TIME_INTERPOLATOR_SOURCE_OFFSET !=0 || IA64_TIME_INTERPOLATOR_SHIFT_OFFSET != 2 \ - || IA64_TIME_INTERPOLATOR_JITTER_OFFSET != 3 || IA64_TIME_INTERPOLATOR_NSEC_OFFSET != 4 -#error fsys_gettimeofday incompatible with changes to struct time_interpolator -#endif #define CLOCK_REALTIME 0 #define CLOCK_MONOTONIC 1 #define CLOCK_DIVIDE_BY_1000 0x4000 @@ -177,19 +171,18 @@ ENTRY(fsys_gettimeofday) // r11 = preserved: saved ar.pfs // r12 = preserved: memory stack // r13 = preserved: thread pointer - // r14 = address of mask / mask + // r14 = address of mask / mask value // r15 = preserved: system call number // r16 = preserved: current task pointer // r17 = wall to monotonic use - // r18 = time_interpolator->offset - // r19 = address of wall_to_monotonic - // r20 = pointer to struct time_interpolator / pointer to time_interpolator->address - // r21 = shift factor - // r22 = address of time interpolator->last_counter - // r23 = address of time_interpolator->last_cycle - // r24 = adress of time_interpolator->offset - // r25 = last_cycle value - // r26 = last_counter value + // r19 = address of itc_lastcycle + // r20 = struct clocksource / address of first element + // r21 = shift value + // r22 = address of itc_jitter/ wall_to_monotonic + // r23 = address of shift + // r24 = address mult factor / cycle_last value + // r25 = itc_lastcycle value + // r26 = address clocksource cycle_last // r27 = pointer to xtime // r28 = sequence number at the beginning of critcal section // r29 = address of seqlock @@ -199,9 +192,9 @@ ENTRY(fsys_gettimeofday) // p6,p7 short term use // p8 = timesource ar.itc // p9 = timesource mmio64 - // p10 = timesource mmio32 + // p10 = timesource mmio32 - not used // p11 = timesource not to be handled by asm code - // p12 = memory time source ( = p9 | p10) + // p12 = memory time source ( = p9 | p10) - not used // p13 = do cmpxchg with time_interpolator_last_cycle // p14 = Divide by 1000 // p15 = Add monotonic @@ -212,61 +205,55 @@ ENTRY(fsys_gettimeofday) tnat.nz p6,p0 = r31 // branch deferred since it does not fit into bundle structure mov pr = r30,0xc000 // Set predicates according to function add r2 = TI_FLAGS+IA64_TASK_SIZE,r16 - movl r20 = time_interpolator + movl r20 = fsyscall_clock // load fsyscall clocksource address ;; - ld8 r20 = [r20] // get pointer to time_interpolator structure + add r10 = IA64_CLOCKSOURCE_MMIO_PTR_OFFSET,r20 movl r29 = xtime_lock ld4 r2 = [r2] // process work pending flags movl r27 = xtime ;; // only one bundle here - ld8 r21 = [r20] // first quad with control information + add r14 = IA64_CLOCKSOURCE_MASK_OFFSET,r20 + movl r22 = itc_jitter + add r24 = IA64_CLOCKSOURCE_MULT_OFFSET,r20 and r2 = TIF_ALLWORK_MASK,r2 (p6) br.cond.spnt.few .fail_einval // deferred branch ;; - add r10 = IA64_TIME_INTERPOLATOR_ADDRESS_OFFSET,r20 - extr r3 = r21,32,32 // time_interpolator->nsec_per_cyc - extr r8 = r21,0,16 // time_interpolator->source + ld8 r30 = [r10] // clocksource->mmio_ptr + movl r19 = itc_lastcycle + add r23 = IA64_CLOCKSOURCE_SHIFT_OFFSET,r20 cmp.ne p6, p0 = 0, r2 // Fallback if work is scheduled (p6) br.cond.spnt.many fsys_fallback_syscall ;; - cmp.eq p8,p12 = 0,r8 // Check for cpu timer - cmp.eq p9,p0 = 1,r8 // MMIO64 ? - extr r2 = r21,24,8 // time_interpolator->jitter - cmp.eq p10,p0 = 2,r8 // MMIO32 ? - cmp.ltu p11,p0 = 2,r8 // function or other clock -(p11) br.cond.spnt.many fsys_fallback_syscall - ;; - setf.sig f7 = r3 // Setup for scaling of counter -(p15) movl r19 = wall_to_monotonic -(p12) ld8 r30 = [r10] - cmp.ne p13,p0 = r2,r0 // need jitter compensation? - extr r21 = r21,16,8 // shift factor + ld8 r14 = [r14] // clocksource mask value + ld4 r2 = [r22] // itc_jitter value + add r26 = IA64_CLOCKSOURCE_CYCLE_LAST_OFFSET,r20 // clock fsyscall_cycle_last + ld4 r3 = [r24] // clocksource->mult value + cmp.eq p8,p9 = 0,r30 // Check for cpu timer, no mmio_ptr, set p8, clear p9 + ;; + setf.sig f7 = r3 // Setup for mult scaling of counter +(p15) movl r22 = wall_to_monotonic + ld4 r21 = [r23] // shift value +(p8) cmp.ne p13,p0 = r2,r0 // need jitter compensation, set p13 +(p9) cmp.eq p13,p0 = 0,r30 // if mmio_ptr, clear p13 jitter control ;; .time_redo: .pred.rel.mutex p8,p9,p10 ld4.acq r28 = [r29] // xtime_lock.sequence. Must come first for locking purposes (p8) mov r2 = ar.itc // CPU_TIMER. 36 clocks latency!!! - add r22 = IA64_TIME_INTERPOLATOR_LAST_COUNTER_OFFSET,r20 (p9) ld8 r2 = [r30] // readq(ti->address). Could also have latency issues.. -(p10) ld4 r2 = [r30] // readw(ti->address) -(p13) add r23 = IA64_TIME_INTERPOLATOR_LAST_CYCLE_OFFSET,r20 +(p13) ld8 r25 = [r19] // get itc_lastcycle value ;; // could be removed by moving the last add upward - ld8 r26 = [r22] // time_interpolator->last_counter -(p13) ld8 r25 = [r23] // time interpolator->last_cycle - add r24 = IA64_TIME_INTERPOLATOR_OFFSET_OFFSET,r20 -(p15) ld8 r17 = [r19],IA64_TIMESPEC_TV_NSEC_OFFSET ld8 r9 = [r27],IA64_TIMESPEC_TV_NSEC_OFFSET - add r14 = IA64_TIME_INTERPOLATOR_MASK_OFFSET, r20 + ld8 r24 = [r26] // get fsyscall_cycle_last value +(p15) ld8 r17 = [r22],IA64_TIMESPEC_TV_NSEC_OFFSET ;; - ld8 r18 = [r24] // time_interpolator->offset ld8 r8 = [r27],-IA64_TIMESPEC_TV_NSEC_OFFSET // xtime.tv_nsec -(p13) sub r3 = r25,r2 // Diff needed before comparison (thanks davidm) +(p13) sub r3 = r25,r2 // Diff needed before comparison (thanks davidm) ;; - ld8 r14 = [r14] // time_interpolator->mask -(p13) cmp.gt.unc p6,p7 = r3,r0 // check if it is less than last. p6,p7 cleared - sub r10 = r2,r26 // current_counter - last_counter +(p13) cmp.gt.unc p6,p7 = r3,r0 // check if it is less than last. p6,p7 cleared + sub r10 = r2,r24 // current_counter - last_counter ;; -(p6) sub r10 = r25,r26 // time we got was less than last_cycle +(p6) sub r10 = r25,r24 // time we got was less than last_cycle (p7) mov ar.ccv = r25 // more than last_cycle. Prep for cmpxchg ;; and r10 = r10,r14 // Apply mask @@ -274,22 +261,21 @@ ENTRY(fsys_gettimeofday) setf.sig f8 = r10 nop.i 123 ;; -(p7) cmpxchg8.rel r3 = [r23],r2,ar.ccv +(p7) cmpxchg8.rel r3 = [r19],r2,ar.ccv EX(.fail_efault, probe.w.fault r31, 3) // This takes 5 cycles and we have spare time xmpy.l f8 = f8,f7 // nsec_per_cyc*(counter-last_counter) (p15) add r9 = r9,r17 // Add wall to monotonic.secs to result secs ;; -(p15) ld8 r17 = [r19],-IA64_TIMESPEC_TV_NSEC_OFFSET +(p15) ld8 r17 = [r22],-IA64_TIMESPEC_TV_NSEC_OFFSET (p7) cmp.ne p7,p0 = r25,r3 // if cmpxchg not successful redo // simulate tbit.nz.or p7,p0 = r28,0 and r28 = ~1,r28 // Make sequence even to force retry if odd getf.sig r2 = f8 mf - add r8 = r8,r18 // Add time interpolator offset ;; ld4 r10 = [r29] // xtime_lock.sequence (p15) add r8 = r8, r17 // Add monotonic.nsecs to nsecs - shr.u r2 = r2,r21 + shr.u r2 = r2,r21 // shift by factor ;; // overloaded 3 bundles! // End critical section. add r8 = r8,r2 // Add xtime.nsecs @@ -348,6 +334,26 @@ ENTRY(fsys_clock_gettime) br.many .gettime END(fsys_clock_gettime) + +#else // !CONFIG_TIME_INTERPOLATION + +# define fsys_gettimeofday 0 +# define fsys_clock_gettime 0 + +.fail_einval: + mov r8 = EINVAL + mov r10 = -1 + FSYS_RETURN + +.fail_efault: + mov r8 = EFAULT + mov r10 = -1 + FSYS_RETURN + +#endif + + + /* * long fsys_rt_sigprocmask (int how, sigset_t *set, sigset_t *oset, size_t sigsetsize). */ Index: linux/arch/ia64/kernel/iosapic.c =================================================================== --- linux.orig/arch/ia64/kernel/iosapic.c +++ linux/arch/ia64/kernel/iosapic.c @@ -112,7 +112,7 @@ (PAGE_SIZE / sizeof(struct iosapic_rte_info)) #define RTE_PREALLOCATED (1) -static DEFINE_SPINLOCK(iosapic_lock); +static DEFINE_RAW_SPINLOCK(iosapic_lock); /* * These tables map IA-64 vectors to the IOSAPIC pin that generates this @@ -409,6 +409,34 @@ iosapic_startup_level_irq (unsigned int return 0; } +/* + * In the preemptible case mask the IRQ first then handle it and ack it. + */ +#ifdef CONFIG_PREEMPT_HARDIRQS + +static void +iosapic_ack_level_irq (unsigned int irq) +{ + ia64_vector vec = irq_to_vector(irq); + struct iosapic_rte_info *rte; + + move_irq(irq); + mask_irq(irq); + list_for_each_entry(rte, &iosapic_intr_info[vec].rtes, rte_list) + iosapic_eoi(rte->addr, vec); +} + +static void +iosapic_end_level_irq (unsigned int irq) +{ + if (!(irq_desc[irq].status & IRQ_INPROGRESS)) + unmask_irq(irq); +} + +#else /* !CONFIG_PREEMPT_HARDIRQS */ + +#define iosapic_ack_level_irq nop + static void iosapic_end_level_irq (unsigned int irq) { @@ -420,10 +448,12 @@ iosapic_end_level_irq (unsigned int irq) iosapic_eoi(rte->addr, vec); } + +#endif + #define iosapic_shutdown_level_irq mask_irq #define iosapic_enable_level_irq unmask_irq #define iosapic_disable_level_irq mask_irq -#define iosapic_ack_level_irq nop struct hw_interrupt_type irq_type_iosapic_level = { .name = "IO-SAPIC-level", Index: linux/arch/ia64/kernel/mca.c =================================================================== --- linux.orig/arch/ia64/kernel/mca.c +++ linux/arch/ia64/kernel/mca.c @@ -319,7 +319,7 @@ ia64_mca_spin(const char *func) typedef struct ia64_state_log_s { - spinlock_t isl_lock; + raw_spinlock_t isl_lock; int isl_index; unsigned long isl_count; ia64_err_rec_t *isl_log[IA64_MAX_LOGS]; /* need space to store header + error log */ Index: linux/arch/ia64/kernel/palinfo.c =================================================================== --- linux.orig/arch/ia64/kernel/palinfo.c +++ linux/arch/ia64/kernel/palinfo.c @@ -952,7 +952,6 @@ remove_palinfo_proc_entries(unsigned int } } -#ifdef CONFIG_HOTPLUG_CPU static int palinfo_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -974,7 +973,6 @@ static struct notifier_block palinfo_cpu .notifier_call = palinfo_cpu_callback, .priority = 0, }; -#endif static int __init palinfo_init(void) Index: linux/arch/ia64/kernel/perfmon.c =================================================================== --- linux.orig/arch/ia64/kernel/perfmon.c +++ linux/arch/ia64/kernel/perfmon.c @@ -281,7 +281,7 @@ typedef struct { */ typedef struct pfm_context { - spinlock_t ctx_lock; /* context protection */ + raw_spinlock_t ctx_lock; /* context protection */ pfm_context_flags_t ctx_flags; /* bitmask of flags (block reason incl.) */ unsigned int ctx_state; /* state: active/inactive (no bitfield) */ @@ -370,7 +370,7 @@ typedef struct pfm_context { * mostly used to synchronize between system wide and per-process */ typedef struct { - spinlock_t pfs_lock; /* lock the structure */ + raw_spinlock_t pfs_lock; /* lock the structure */ unsigned int pfs_task_sessions; /* number of per task sessions */ unsigned int pfs_sys_sessions; /* number of per system wide sessions */ @@ -511,7 +511,7 @@ static pfm_intr_handler_desc_t *pfm_alt static struct proc_dir_entry *perfmon_dir; static pfm_uuid_t pfm_null_uuid = {0,}; -static spinlock_t pfm_buffer_fmt_lock; +static raw_spinlock_t pfm_buffer_fmt_lock; static LIST_HEAD(pfm_buffer_fmt_list); static pmu_config_t *pmu_conf; Index: linux/arch/ia64/kernel/process.c =================================================================== --- linux.orig/arch/ia64/kernel/process.c +++ linux/arch/ia64/kernel/process.c @@ -94,6 +94,9 @@ show_stack (struct task_struct *task, un void dump_stack (void) { + if (irqs_disabled()) { + printk("Uh oh.. entering dump_stack() with irqs disabled.\n"); + } show_stack(NULL, NULL); } @@ -197,7 +200,7 @@ void default_idle (void) { local_irq_enable(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { if (can_do_pal_halt) safe_halt(); else @@ -273,7 +276,7 @@ cpu_idle (void) else current_thread_info()->status |= TS_POLLING; - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { void (*idle)(void); #ifdef CONFIG_SMP min_xtp(); @@ -295,10 +298,11 @@ cpu_idle (void) normal_xtp(); #endif } - preempt_enable_no_resched(); - schedule(); + __preempt_enable_no_resched(); + __schedule(); + preempt_disable(); - check_pgt_cache(); + if (cpu_is_offline(cpu)) play_dead(); } Index: linux/arch/ia64/kernel/sal.c =================================================================== --- linux.orig/arch/ia64/kernel/sal.c +++ linux/arch/ia64/kernel/sal.c @@ -18,7 +18,7 @@ #include #include - __cacheline_aligned DEFINE_SPINLOCK(sal_lock); + __cacheline_aligned DEFINE_RAW_SPINLOCK(sal_lock); unsigned long sal_platform_features; unsigned short sal_revision; Index: linux/arch/ia64/kernel/salinfo.c =================================================================== --- linux.orig/arch/ia64/kernel/salinfo.c +++ linux/arch/ia64/kernel/salinfo.c @@ -141,7 +141,7 @@ enum salinfo_state { struct salinfo_data { cpumask_t cpu_event; /* which cpus have outstanding events */ - struct semaphore mutex; + struct compat_semaphore mutex; u8 *log_buffer; u64 log_size; u8 *oemdata; /* decoded oem data */ @@ -157,8 +157,8 @@ struct salinfo_data { static struct salinfo_data salinfo_data[ARRAY_SIZE(salinfo_log_name)]; -static DEFINE_SPINLOCK(data_lock); -static DEFINE_SPINLOCK(data_saved_lock); +static DEFINE_RAW_SPINLOCK(data_lock); +static DEFINE_RAW_SPINLOCK(data_saved_lock); /** salinfo_platform_oemdata - optional callback to decode oemdata from an error * record. @@ -575,7 +575,6 @@ static struct file_operations salinfo_da .write = salinfo_log_write, }; -#ifdef CONFIG_HOTPLUG_CPU static int __devinit salinfo_cpu_callback(struct notifier_block *nb, unsigned long action, void *hcpu) { @@ -620,7 +619,6 @@ static struct notifier_block salinfo_cpu .notifier_call = salinfo_cpu_callback, .priority = 0, }; -#endif /* CONFIG_HOTPLUG_CPU */ static int __init salinfo_init(void) Index: linux/arch/ia64/kernel/semaphore.c =================================================================== --- linux.orig/arch/ia64/kernel/semaphore.c +++ linux/arch/ia64/kernel/semaphore.c @@ -40,12 +40,12 @@ */ void -__up (struct semaphore *sem) +__up (struct compat_semaphore *sem) { wake_up(&sem->wait); } -void __sched __down (struct semaphore *sem) +void __sched __down (struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -82,7 +82,7 @@ void __sched __down (struct semaphore *s tsk->state = TASK_RUNNING; } -int __sched __down_interruptible (struct semaphore * sem) +int __sched __down_interruptible (struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -142,7 +142,7 @@ int __sched __down_interruptible (struct * count. */ int -__down_trylock (struct semaphore *sem) +__down_trylock (struct compat_semaphore *sem) { unsigned long flags; int sleepers; Index: linux/arch/ia64/kernel/signal.c =================================================================== --- linux.orig/arch/ia64/kernel/signal.c +++ linux/arch/ia64/kernel/signal.c @@ -487,6 +487,14 @@ ia64_do_signal (sigset_t *oldset, struct long errno = scr->pt.r8; # define ERR_CODE(c) (IS_IA32_PROCESS(&scr->pt) ? -(c) : (c)) +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif + /* * In the ia64_leave_kernel code path, we want the common case to go fast, which * is why we may in certain cases get here from kernel mode. Just return without Index: linux/arch/ia64/kernel/smp.c =================================================================== --- linux.orig/arch/ia64/kernel/smp.c +++ linux/arch/ia64/kernel/smp.c @@ -222,6 +222,22 @@ smp_send_reschedule (int cpu) platform_send_ipi(cpu, IA64_IPI_RESCHEDULE, IA64_IPI_DM_INT, 0); } +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) { + if (cpu != smp_processor_id()) + platform_send_ipi(cpu, IA64_IPI_RESCHEDULE, IA64_IPI_DM_INT, 0); + } +} + + void smp_flush_tlb_all (void) { Index: linux/arch/ia64/kernel/smpboot.c =================================================================== --- linux.orig/arch/ia64/kernel/smpboot.c +++ linux/arch/ia64/kernel/smpboot.c @@ -371,6 +371,8 @@ smp_setup_percpu_timer (void) { } +extern void register_itc_clockevent(void); + static void __devinit smp_callin (void) { @@ -430,6 +432,7 @@ smp_callin (void) #ifdef CONFIG_IA32_SUPPORT ia32_gdt_init(); #endif + register_itc_clockevent(); /* * Allow the master to continue. Index: linux/arch/ia64/kernel/time.c =================================================================== --- linux.orig/arch/ia64/kernel/time.c +++ linux/arch/ia64/kernel/time.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -29,6 +30,10 @@ #include #include +static cycle_t itc_get_cycles(void); +cycle_t itc_lastcycle __attribute__((aligned(L1_CACHE_BYTES))); +int itc_jitter __attribute__((aligned(L1_CACHE_BYTES))); + volatile int time_keeper_id = 0; /* smp_processor_id() of time-keeper */ #ifdef CONFIG_IA64_DEBUG_IRQ @@ -38,11 +43,16 @@ EXPORT_SYMBOL(last_cli_ip); #endif -static struct time_interpolator itc_interpolator = { - .shift = 16, - .mask = 0xffffffffffffffffLL, - .source = TIME_SOURCE_CPU +static struct clocksource clocksource_itc = { + .name = "itc", + .rating = 350, + .read = itc_get_cycles, + .mask = 0xffffffffffffffffLL, + .mult = 0, /*to be caluclated*/ + .shift = 16, + .is_continuous = 1, }; +static struct clocksource *clocksource_itc_p; static irqreturn_t timer_interrupt (int irq, void *dev_id) @@ -55,6 +65,7 @@ timer_interrupt (int irq, void *dev_id) platform_timer_interrupt(irq, dev_id); +#if 0 new_itm = local_cpu_data->itm_next; if (!time_after(ia64_get_itc(), new_itm)) @@ -62,29 +73,48 @@ timer_interrupt (int irq, void *dev_id) ia64_get_itc(), new_itm); profile_tick(CPU_PROFILING); +#endif + + if (time_after(ia64_get_itc(), local_cpu_data->itm_tick_next)) { - while (1) { - update_process_times(user_mode(get_irq_regs())); + unsigned long new_tick_itm; + new_tick_itm = local_cpu_data->itm_tick_next; - new_itm += local_cpu_data->itm_delta; + profile_tick(CPU_PROFILING, get_irq_regs()); + + while (1) { + update_process_times(user_mode(get_irq_regs())); + + new_tick_itm += local_cpu_data->itm_tick_delta; + + if (smp_processor_id() == time_keeper_id) { + /* + * Here we are in the timer irq handler. We have irqs locally + * disabled, but we don't know if the timer_bh is running on + * another CPU. We need to avoid to SMP race by acquiring the + * xtime_lock. + */ + write_seqlock(&xtime_lock); + do_timer(get_irq_regs()); + local_cpu_data->itm_tick_next = new_tick_itm; + write_sequnlock(&xtime_lock); + } else + local_cpu_data->itm_tick_next = new_tick_itm; + + if (time_after(new_tick_itm, ia64_get_itc())) + break; + } + } - if (smp_processor_id() == time_keeper_id) { - /* - * Here we are in the timer irq handler. We have irqs locally - * disabled, but we don't know if the timer_bh is running on - * another CPU. We need to avoid to SMP race by acquiring the - * xtime_lock. - */ - write_seqlock(&xtime_lock); - do_timer(1); - local_cpu_data->itm_next = new_itm; - write_sequnlock(&xtime_lock); - } else - local_cpu_data->itm_next = new_itm; + if (time_after(ia64_get_itc(), local_cpu_data->itm_timer_next)) { + if (itc_clockevent.event_handler) + itc_clockevent.event_handler(get_irq_regs()); - if (time_after(new_itm, ia64_get_itc())) - break; + // FIXME, really, please + new_itm = local_cpu_data->itm_tick_next; + if (time_after(new_itm, local_cpu_data->itm_timer_next)) + new_itm = local_cpu_data->itm_timer_next; /* * Allow IPIs to interrupt the timer loop. */ @@ -102,8 +132,8 @@ timer_interrupt (int irq, void *dev_id) * too fast (with the potentially devastating effect * of losing monotony of time). */ - while (!time_after(new_itm, ia64_get_itc() + local_cpu_data->itm_delta/2)) - new_itm += local_cpu_data->itm_delta; + while (!time_after(new_itm, ia64_get_itc() + local_cpu_data->itm_tick_delta/2)) + new_itm += local_cpu_data->itm_tick_delta; ia64_set_itm(new_itm); /* double check, in case we got hit by a (slow) PMI: */ } while (time_after_eq(ia64_get_itc(), new_itm)); @@ -122,7 +152,7 @@ ia64_cpu_local_tick (void) /* arrange for the cycle counter to generate a timer interrupt: */ ia64_set_itv(IA64_TIMER_VECTOR); - delta = local_cpu_data->itm_delta; + delta = local_cpu_data->itm_tick_delta; /* * Stagger the timer tick for each CPU so they don't occur all at (almost) the * same time: @@ -131,8 +161,8 @@ ia64_cpu_local_tick (void) unsigned long hi = 1UL << ia64_fls(cpu); shift = (2*(cpu - hi) + 1) * delta/hi/2; } - local_cpu_data->itm_next = ia64_get_itc() + delta + shift; - ia64_set_itm(local_cpu_data->itm_next); + local_cpu_data->itm_tick_next = ia64_get_itc() + delta + shift; + ia64_set_itm(local_cpu_data->itm_tick_next); } static int nojitter; @@ -190,7 +220,7 @@ ia64_init_itm (void) itc_freq = (platform_base_freq*itc_ratio.num)/itc_ratio.den; - local_cpu_data->itm_delta = (itc_freq + HZ/2) / HZ; + local_cpu_data->itm_tick_delta = (itc_freq + HZ/2) / HZ; printk(KERN_DEBUG "CPU %d: base freq=%lu.%03luMHz, ITC ratio=%u/%u, " "ITC freq=%lu.%03luMHz", smp_processor_id(), platform_base_freq / 1000000, (platform_base_freq / 1000) % 1000, @@ -210,9 +240,8 @@ ia64_init_itm (void) local_cpu_data->nsec_per_cyc = ((NSEC_PER_SEC<itc_freq; - itc_interpolator.drift = itc_drift; #ifdef CONFIG_SMP /* On IA64 in an SMP configuration ITCs are never accurately synchronized. * Jitter compensation requires a cmpxchg which may limit @@ -224,18 +253,57 @@ ia64_init_itm (void) * even going backward) if the ITC offsets between the individual CPUs * are too large. */ - if (!nojitter) itc_interpolator.jitter = 1; + if (!nojitter) itc_jitter = 1; #endif - register_time_interpolator(&itc_interpolator); } +#endif /* Setup the CPU local timer tick */ ia64_cpu_local_tick(); + + if (!clocksource_itc_p) { + /* Sort out mult/shift values: */ + clocksource_itc.mult = clocksource_hz2mult(local_cpu_data->itc_freq, + clocksource_itc.shift); + clocksource_register(&clocksource_itc); + clocksource_itc_p = &clocksource_itc; + } } + +static cycle_t itc_get_cycles() +{ + if (itc_jitter) { + u64 lcycle; + u64 now; + + do { + lcycle = itc_lastcycle; + now = get_cycles(); + if (lcycle && time_after(lcycle, now)) + return lcycle; + + /* When holding the xtime write lock, there's no need + * to add the overhead of the cmpxchg. Readers are + * force to retry until the write lock is released. + */ + if (spin_is_locked(&xtime_lock.lock)) { + itc_lastcycle = now; + return now; + } + /* Keep track of the last timer value returned. The use of cmpxchg here + * will cause contention in an SMP environment. + */ + } while (unlikely(cmpxchg(&itc_lastcycle, lcycle, now) != lcycle)); + return now; + } else + return get_cycles(); +} + + static struct irqaction timer_irqaction = { .handler = timer_interrupt, - .flags = IRQF_DISABLED, + .flags = IRQF_DISABLED | IRQF_NODELAY, .name = "timer" }; @@ -256,6 +324,8 @@ time_init (void) * tv_nsec field must be normalized (i.e., 0 <= nsec < NSEC_PER_SEC). */ set_normalized_timespec(&wall_to_monotonic, -xtime.tv_sec, -xtime.tv_nsec); + register_itc_clocksource(); + register_itc_clockevent(); } /* @@ -308,3 +378,10 @@ ia64_setup_printk_clock(void) if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT)) ia64_printk_clock = ia64_itc_printk_clock; } + +struct clocksource fsyscall_clock __attribute__((aligned(L1_CACHE_BYTES))); + +void update_vsyscall(struct timespec *wall, struct clocksource *c) +{ + fsyscall_clock = *c; +} Index: linux/arch/ia64/kernel/traps.c =================================================================== --- linux.orig/arch/ia64/kernel/traps.c +++ linux/arch/ia64/kernel/traps.c @@ -24,7 +24,7 @@ #include #include -extern spinlock_t timerlist_lock; +extern raw_spinlock_t timerlist_lock; fpswa_interface_t *fpswa_interface; EXPORT_SYMBOL(fpswa_interface); @@ -85,11 +85,11 @@ void die (const char *str, struct pt_regs *regs, long err) { static struct { - spinlock_t lock; + raw_spinlock_t lock; u32 lock_owner; int lock_owner_depth; } die = { - .lock = SPIN_LOCK_UNLOCKED, + .lock = RAW_SPIN_LOCK_UNLOCKED, .lock_owner = -1, .lock_owner_depth = 0 }; @@ -226,7 +226,7 @@ __kprobes ia64_bad_break (unsigned long * access to fph by the time we get here, as the IVT's "Disabled FP-Register" handler takes * care of clearing psr.dfh. */ -static inline void +void disabled_fph_fault (struct pt_regs *regs) { struct ia64_psr *psr = ia64_psr(regs); @@ -245,7 +245,7 @@ disabled_fph_fault (struct pt_regs *regs = (struct task_struct *)ia64_get_kr(IA64_KR_FPU_OWNER); if (ia64_is_local_fpu_owner(current)) { - preempt_enable_no_resched(); + __preempt_enable_no_resched(); return; } @@ -265,7 +265,7 @@ disabled_fph_fault (struct pt_regs *regs */ psr->mfh = 1; } - preempt_enable_no_resched(); + __preempt_enable_no_resched(); } static inline int Index: linux/arch/ia64/kernel/unwind.c =================================================================== --- linux.orig/arch/ia64/kernel/unwind.c +++ linux/arch/ia64/kernel/unwind.c @@ -81,7 +81,7 @@ typedef unsigned long unw_word; typedef unsigned char unw_hash_index_t; static struct { - spinlock_t lock; /* spinlock for unwind data */ + raw_spinlock_t lock; /* spinlock for unwind data */ /* list of unwind tables (one per load-module) */ struct unw_table *tables; @@ -145,7 +145,7 @@ static struct { # endif } unw = { .tables = &unw.kernel_table, - .lock = SPIN_LOCK_UNLOCKED, + .lock = RAW_SPIN_LOCK_UNLOCKED, .save_order = { UNW_REG_RP, UNW_REG_PFS, UNW_REG_PSP, UNW_REG_PR, UNW_REG_UNAT, UNW_REG_LC, UNW_REG_FPSR, UNW_REG_PRI_UNAT_GR Index: linux/arch/ia64/kernel/unwind_i.h =================================================================== --- linux.orig/arch/ia64/kernel/unwind_i.h +++ linux/arch/ia64/kernel/unwind_i.h @@ -154,7 +154,7 @@ struct unw_script { unsigned long ip; /* ip this script is for */ unsigned long pr_mask; /* mask of predicates script depends on */ unsigned long pr_val; /* predicate values this script is for */ - rwlock_t lock; + raw_rwlock_t lock; unsigned int flags; /* see UNW_FLAG_* in unwind.h */ unsigned short lru_chain; /* used for least-recently-used chain */ unsigned short coll_chain; /* used for hash collisions */ Index: linux/arch/ia64/mm/init.c =================================================================== --- linux.orig/arch/ia64/mm/init.c +++ linux/arch/ia64/mm/init.c @@ -36,7 +36,7 @@ #include #include -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist); DEFINE_PER_CPU(long, __pgtable_quicklist_size); @@ -92,15 +92,11 @@ check_pgt_cache(void) if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES)) return; - preempt_disable(); while (unlikely((pages_to_free = min_pages_to_free()) > 0)) { while (pages_to_free--) { free_page((unsigned long)pgtable_quicklist_alloc()); } - preempt_enable(); - preempt_disable(); } - preempt_enable(); } void Index: linux/arch/ia64/mm/tlb.c =================================================================== --- linux.orig/arch/ia64/mm/tlb.c +++ linux/arch/ia64/mm/tlb.c @@ -32,7 +32,7 @@ static struct { } purge; struct ia64_ctx ia64_ctx = { - .lock = SPIN_LOCK_UNLOCKED, + .lock = RAW_SPIN_LOCK_UNLOCKED, .next = 1, .max_ctx = ~0U }; Index: linux/arch/ia64/sn/kernel/sn2/timer.c =================================================================== --- linux.orig/arch/ia64/sn/kernel/sn2/timer.c +++ linux/arch/ia64/sn/kernel/sn2/timer.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -22,11 +23,21 @@ extern unsigned long sn_rtc_cycles_per_second; -static struct time_interpolator sn2_interpolator = { - .drift = -1, - .shift = 10, - .mask = (1LL << 55) - 1, - .source = TIME_SOURCE_MMIO64 +static void __iomem *sn2_mc_ptr; + +static cycle_t read_sn2(void) +{ + return (cycle_t)readq(sn2_mc_ptr); +} + +static struct clocksource clocksource_sn2 = { + .name = "sn2_rtc", + .rating = 300, + .read = read_sn2, + .mask = (1LL << 55) - 1, + .mult = 0, + .shift = 10, + .is_continuous = 1, }; /* @@ -47,9 +58,10 @@ ia64_sn_udelay (unsigned long usecs) void __init sn_timer_init(void) { - sn2_interpolator.frequency = sn_rtc_cycles_per_second; - sn2_interpolator.addr = RTC_COUNTER_ADDR; - register_time_interpolator(&sn2_interpolator); + clocksource_sn2.fsys_mmio_ptr = sn2_mc_ptr = RTC_COUNTER_ADDR; + clocksource_sn2.mult = clocksource_hz2mult(sn_rtc_cycles_per_second, + clocksource_sn2.shift); + clocksource_register(&clocksource_sn2); ia64_udelay = &ia64_sn_udelay; } Index: linux/arch/mips/Kconfig =================================================================== --- linux.orig/arch/mips/Kconfig +++ linux/arch/mips/Kconfig @@ -364,6 +364,7 @@ config MOMENCO_JAGUAR_ATX config MOMENCO_OCELOT bool "Momentum Ocelot board" select DMA_NONCOHERENT + select NO_SPINLOCK select HW_HAS_PCI select IRQ_CPU select IRQ_CPU_RM7K @@ -782,6 +783,7 @@ source "arch/mips/cobalt/Kconfig" endmenu + config RWSEM_GENERIC_SPINLOCK bool default y @@ -789,6 +791,10 @@ config RWSEM_GENERIC_SPINLOCK config RWSEM_XCHGADD_ALGORITHM bool +config ASM_SEMAPHORES + bool + default y + config GENERIC_FIND_NEXT_BIT bool default y @@ -838,6 +844,9 @@ config DMA_NEED_PCI_MAP_STATE config OWN_DMA bool +config NO_SPINLOCK + bool + config EARLY_PRINTK bool @@ -1784,12 +1793,17 @@ config MIPS_INSANE_LARGE This will result in additional memory usage, so it is not recommended for normal users. -endmenu - -config RWSEM_GENERIC_SPINLOCK +config GENERIC_TIME bool default y +source "kernel/time/Kconfig" + +config CPU_SPEED + int "CPU speed used for clocksource/clockevent calculations" + default 600 +endmenu + config LOCKDEP_SUPPORT bool default y Index: linux/arch/mips/kernel/Makefile =================================================================== --- linux.orig/arch/mips/kernel/Makefile +++ linux/arch/mips/kernel/Makefile @@ -5,7 +5,7 @@ extra-y := head.o init_task.o vmlinux.lds obj-y += cpu-probe.o branch.o entry.o genex.o irq.o process.o \ - ptrace.o reset.o semaphore.o setup.o signal.o syscall.o \ + ptrace.o reset.o setup.o signal.o syscall.o \ time.o topology.o traps.o unaligned.o binfmt_irix-objs := irixelf.o irixinv.o irixioctl.o irixsig.o \ @@ -16,6 +16,8 @@ obj-$(CONFIG_MODULES) += mips_ksyms.o m obj-$(CONFIG_APM) += apm.o +obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o + obj-$(CONFIG_CPU_R3000) += r2300_fpu.o r2300_switch.o obj-$(CONFIG_CPU_TX39XX) += r2300_fpu.o r2300_switch.o obj-$(CONFIG_CPU_TX49XX) += r4k_fpu.o r4k_switch.o Index: linux/arch/mips/kernel/asm-offsets.c =================================================================== --- linux.orig/arch/mips/kernel/asm-offsets.c +++ linux/arch/mips/kernel/asm-offsets.c @@ -10,9 +10,11 @@ */ #include #include +#include #include #include #include +#include #include #include Index: linux/arch/mips/kernel/entry.S =================================================================== --- linux.orig/arch/mips/kernel/entry.S +++ linux/arch/mips/kernel/entry.S @@ -22,7 +22,7 @@ #ifndef CONFIG_PREEMPT .macro preempt_stop - local_irq_disable + raw_local_irq_disable .endm #define resume_kernel restore_all #endif @@ -44,7 +44,7 @@ FEXPORT(_ret_from_irq) beqz t0, resume_kernel resume_userspace: - local_irq_disable # make sure we dont miss an + raw_local_irq_disable # make sure we dont miss an # interrupt setting need_resched # between sampling and return LONG_L a2, TI_FLAGS($28) # current->work @@ -54,7 +54,9 @@ resume_userspace: #ifdef CONFIG_PREEMPT resume_kernel: - local_irq_disable + raw_local_irq_disable + lw t0, kernel_preemption + beqz t0, restore_all lw t0, TI_PRE_COUNT($28) bnez t0, restore_all need_resched: @@ -64,7 +66,9 @@ need_resched: LONG_L t0, PT_STATUS(sp) # Interrupts off? andi t0, 1 beqz t0, restore_all + raw_local_irq_disable jal preempt_schedule_irq + sw zero, TI_PRE_COUNT($28) b need_resched #endif @@ -72,7 +76,7 @@ FEXPORT(ret_from_fork) jal schedule_tail # a0 = struct task_struct *prev FEXPORT(syscall_exit) - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work @@ -139,19 +143,21 @@ FEXPORT(restore_partial) # restore part .set at work_pending: - andi t0, a2, _TIF_NEED_RESCHED # a2 is preloaded with TI_FLAGS + # a2 is preloaded with TI_FLAGS + andi t0, a2, (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beqz t0, work_notifysig work_resched: + raw_local_irq_enable t0 jal schedule - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) andi t0, a2, _TIF_WORK_MASK # is there any work to be done # other than syscall tracing? beqz t0, restore_all - andi t0, a2, _TIF_NEED_RESCHED + andi t0, a2, (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bnez t0, work_resched work_notifysig: # deal with pending signals and @@ -167,7 +173,7 @@ syscall_exit_work: li t0, _TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT and t0, a2 # a2 is preloaded with TI_FLAGS beqz t0, work_pending # trace bit set? - local_irq_enable # could let do_syscall_trace() + raw_local_irq_enable # could let do_syscall_trace() # call schedule() instead move a0, sp li a1, 1 Index: linux/arch/mips/kernel/i8259.c =================================================================== --- linux.orig/arch/mips/kernel/i8259.c +++ linux/arch/mips/kernel/i8259.c @@ -31,7 +31,7 @@ void disable_8259A_irq(unsigned int irq) * moves to arch independent land */ -DEFINE_SPINLOCK(i8259A_lock); +DEFINE_RAW_SPINLOCK(i8259A_lock); static void end_8259A_irq (unsigned int irq) { Index: linux/arch/mips/kernel/irq.c =================================================================== --- linux.orig/arch/mips/kernel/irq.c +++ linux/arch/mips/kernel/irq.c @@ -179,7 +179,10 @@ void __init init_IRQ(void) irq_desc[i].action = NULL; irq_desc[i].depth = 1; irq_desc[i].chip = &no_irq_chip; - spin_lock_init(&irq_desc[i].lock); + _raw_spin_lock_init(&irq_desc[i].lock); +#ifdef CONFIG_PREEMPT_HARDIRQS + irq_desc[i].thread = NULL; +#endif #ifdef CONFIG_MIPS_MT_SMTC irq_hwmask[i] = 0; #endif /* CONFIG_MIPS_MT_SMTC */ Index: linux/arch/mips/kernel/module.c =================================================================== --- linux.orig/arch/mips/kernel/module.c +++ linux/arch/mips/kernel/module.c @@ -39,7 +39,7 @@ struct mips_hi16 { static struct mips_hi16 *mips_hi16_list; static LIST_HEAD(dbe_list); -static DEFINE_SPINLOCK(dbe_lock); +static DEFINE_RAW_SPINLOCK(dbe_lock); void *module_alloc(unsigned long size) { Index: linux/arch/mips/kernel/process.c =================================================================== --- linux.orig/arch/mips/kernel/process.c +++ linux/arch/mips/kernel/process.c @@ -55,16 +55,18 @@ ATTRIB_NORET void cpu_idle(void) { /* endless idle loop with no priority at all */ while (1) { - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { #ifdef CONFIG_MIPS_MT_SMTC smtc_idle_loop_hook(); #endif /* CONFIG_MIPS_MT_SMTC */ if (cpu_wait) (*cpu_wait)(); } - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } Index: linux/arch/mips/kernel/scall32-o32.S =================================================================== --- linux.orig/arch/mips/kernel/scall32-o32.S +++ linux/arch/mips/kernel/scall32-o32.S @@ -73,7 +73,7 @@ stack_done: 1: sw v0, PT_R2(sp) # result o32_syscall_exit: - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return lw a2, TI_FLAGS($28) # current->work Index: linux/arch/mips/kernel/scall64-64.S =================================================================== --- linux.orig/arch/mips/kernel/scall64-64.S +++ linux/arch/mips/kernel/scall64-64.S @@ -72,7 +72,7 @@ NESTED(handle_sys64, PT_SIZE, sp) 1: sd v0, PT_R2(sp) # result n64_syscall_exit: - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work Index: linux/arch/mips/kernel/scall64-n32.S =================================================================== --- linux.orig/arch/mips/kernel/scall64-n32.S +++ linux/arch/mips/kernel/scall64-n32.S @@ -69,7 +69,7 @@ NESTED(handle_sysn32, PT_SIZE, sp) sd v0, PT_R0(sp) # set flag for syscall restarting 1: sd v0, PT_R2(sp) # result - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work Index: linux/arch/mips/kernel/scall64-o32.S =================================================================== --- linux.orig/arch/mips/kernel/scall64-o32.S +++ linux/arch/mips/kernel/scall64-o32.S @@ -98,7 +98,7 @@ NESTED(handle_sys, PT_SIZE, sp) 1: sd v0, PT_R2(sp) # result o32_syscall_exit: - local_irq_disable # make need_resched and + raw_local_irq_disable # make need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) Index: linux/arch/mips/kernel/semaphore.c =================================================================== --- linux.orig/arch/mips/kernel/semaphore.c +++ linux/arch/mips/kernel/semaphore.c @@ -36,7 +36,7 @@ * sem->count and sem->waking atomic. Scalability isn't an issue because * this lock is used on UP only so it's just an empty variable. */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -67,7 +67,7 @@ static inline int __sem_update_count(str : "=&r" (old_count), "=&r" (tmp), "=m" (sem->count) : "r" (incr), "m" (sem->count)); } else { - static DEFINE_SPINLOCK(semaphore_lock); + static DEFINE_RAW_SPINLOCK(semaphore_lock); unsigned long flags; spin_lock_irqsave(&semaphore_lock, flags); @@ -80,7 +80,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -94,7 +94,7 @@ void __up(struct semaphore *sem) wake_up(&sem->wait); } -EXPORT_SYMBOL(__up); +EXPORT_SYMBOL(__compat_up); /* * Note that when we come in to __down or __down_interruptible, @@ -104,7 +104,7 @@ EXPORT_SYMBOL(__up); * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -133,9 +133,9 @@ void __sched __down(struct semaphore *se wake_up(&sem->wait); } -EXPORT_SYMBOL(__down); +EXPORT_SYMBOL(__compat_down); -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -165,4 +165,10 @@ int __sched __down_interruptible(struct return retval; } -EXPORT_SYMBOL(__down_interruptible); +EXPORT_SYMBOL(__compat_down_interruptible); + +int fastcall compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} +EXPORT_SYMBOL(compat_sem_is_locked); Index: linux/arch/mips/kernel/signal.c =================================================================== --- linux.orig/arch/mips/kernel/signal.c +++ linux/arch/mips/kernel/signal.c @@ -416,6 +416,10 @@ void do_signal(struct pt_regs *regs) siginfo_t info; int signr; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which is why we may in certain * cases get here from kernel mode. Just return without doing anything Index: linux/arch/mips/kernel/signal32.c =================================================================== --- linux.orig/arch/mips/kernel/signal32.c +++ linux/arch/mips/kernel/signal32.c @@ -807,6 +807,10 @@ void do_signal32(struct pt_regs *regs) siginfo_t info; int signr; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which is why we may in certain * cases get here from kernel mode. Just return without doing anything Index: linux/arch/mips/kernel/smp.c =================================================================== --- linux.orig/arch/mips/kernel/smp.c +++ linux/arch/mips/kernel/smp.c @@ -115,7 +115,22 @@ asmlinkage void start_secondary(void) cpu_idle(); } -DEFINE_SPINLOCK(smp_call_lock); +DEFINE_RAW_SPINLOCK(smp_call_lock); + +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them. + */ +void smp_send_reschedule_allbutself(void) +{ + int cpu = smp_processor_id(); + int i; + + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i) && i != cpu) + core_send_ipi(i, SMP_RESCHEDULE_YOURSELF); +} struct call_data_struct *call_data; @@ -303,6 +318,8 @@ int setup_profiling_timer(unsigned int m return 0; } +static DEFINE_RAW_SPINLOCK(tlbstate_lock); + static void flush_tlb_all_ipi(void *info) { local_flush_tlb_all(); @@ -360,6 +377,7 @@ static inline void smp_on_each_tlb(void void flush_tlb_mm(struct mm_struct *mm) { preempt_disable(); + spin_lock(&tlbstate_lock); if ((atomic_read(&mm->mm_users) != 1) || (current->mm != mm)) { smp_on_other_tlbs(flush_tlb_mm_ipi, (void *)mm); @@ -369,6 +387,7 @@ void flush_tlb_mm(struct mm_struct *mm) if (smp_processor_id() != i) cpu_context(i, mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_mm(mm); preempt_enable(); @@ -392,6 +411,8 @@ void flush_tlb_range(struct vm_area_stru struct mm_struct *mm = vma->vm_mm; preempt_disable(); + spin_lock(&tlbstate_lock); + if ((atomic_read(&mm->mm_users) != 1) || (current->mm != mm)) { struct flush_tlb_data fd; @@ -405,6 +426,7 @@ void flush_tlb_range(struct vm_area_stru if (smp_processor_id() != i) cpu_context(i, mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_range(vma, start, end); preempt_enable(); } @@ -435,6 +457,8 @@ static void flush_tlb_page_ipi(void *inf void flush_tlb_page(struct vm_area_struct *vma, unsigned long page) { preempt_disable(); + spin_lock(&tlbstate_lock); + if ((atomic_read(&vma->vm_mm->mm_users) != 1) || (current->mm != vma->vm_mm)) { struct flush_tlb_data fd; @@ -447,6 +471,7 @@ void flush_tlb_page(struct vm_area_struc if (smp_processor_id() != i) cpu_context(i, vma->vm_mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_page(vma, page); preempt_enable(); } Index: linux/arch/mips/kernel/time.c =================================================================== --- linux.orig/arch/mips/kernel/time.c +++ linux/arch/mips/kernel/time.c @@ -10,6 +10,11 @@ * under the terms of the GNU General Public License as published by the * Free Software Foundation; either version 2 of the License, or (at your * option) any later version. + * + * This implementation of High Res Timers uses two timers. One is the system + * timer. The second is used for the high res timers. The high res timers + * require the CPU to have count/compare registers. The mips_set_next_event() + * function schedules the next high res timer interrupt. */ #include #include @@ -24,6 +29,7 @@ #include #include #include +#include #include #include @@ -48,7 +54,27 @@ /* * forward reference */ -DEFINE_SPINLOCK(rtc_lock); +DEFINE_RAW_SPINLOCK(rtc_lock); + +/* any missed timer interrupts */ +int missed_timer_count; + +#ifdef CONFIG_HIGH_RES_TIMERS +static void mips_set_next_event(unsigned long evt); +static void mips_set_mode(int mode, void *priv); + +static struct clock_event lapic_clockevent = { + .name = "mips clockevent interface", + .capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE | + CLOCK_HAS_IRQHANDLER +#ifdef CONFIG_SMP + | CLOCK_CAP_UPDATE +#endif + , + .shift = 32, + .set_next_event = mips_set_next_event, +}; +#endif /* * By default we provide the null RTC ops @@ -67,19 +93,30 @@ unsigned long (*rtc_mips_get_time)(void) int (*rtc_mips_set_time)(unsigned long) = null_rtc_set_time; int (*rtc_mips_set_mmss)(unsigned long); - /* how many counter cycles in a jiffy */ static unsigned long cycles_per_jiffy __read_mostly; +static unsigned long hrt_cycles_per_jiffy __read_mostly; + + /* expirelo is the count value for next CPU timer interrupt */ static unsigned int expirelo; - /* * Null timer ack for systems not needing one (e.g. i8254). */ static void null_timer_ack(void) { /* nothing */ } +#ifdef CONFIG_HIGH_RES_TIMERS +/* + * Set the next event + */ +static void mips_set_next_event(unsigned long evt) +{ + write_c0_compare(read_c0_count() + evt); +} +#endif + /* * Null high precision timer functions for systems lacking one. */ @@ -92,8 +129,6 @@ static void __init null_hpt_init(void) { /* nothing */ } - - /* * Timer ack for an R4k-compatible timer of a known frequency. */ @@ -103,14 +138,14 @@ static void c0_timer_ack(void) #ifndef CONFIG_SOC_PNX8550 /* pnx8550 resets to zero */ /* Ack this timer interrupt and set the next one. */ - expirelo += cycles_per_jiffy; + expirelo += hrt_cycles_per_jiffy; #endif write_c0_compare(expirelo); - /* Check to see if we have missed any timer interrupts. */ - while (((count = read_c0_count()) - expirelo) < 0x7fffffff) { - /* missed_timer_count++; */ - expirelo = count + cycles_per_jiffy; + count = read_c0_count(); + if ((count - expirelo) < 0x7fffffff) { + /* missed_timer_count++; */ + expirelo = count + hrt_cycles_per_jiffy; write_c0_compare(expirelo); } } @@ -139,6 +174,29 @@ unsigned int mips_hpt_mask = 0xffffffff; /* last time when xtime and rtc are sync'ed up */ static long last_rtc_update; +unsigned long read_persistent_clock(void) +{ + unsigned long sec; + sec = rtc_mips_get_time(); + return sec; +} + +void sync_persistent_clock(struct timespec ts) +{ + if (ntp_synced() && + xtime.tv_sec > last_rtc_update + 660 && + (xtime.tv_nsec / 1000) >= 500000 - ((unsigned) TICK_SIZE) / 2 && + (xtime.tv_nsec / 1000) <= 500000 + ((unsigned) TICK_SIZE) / 2) { + if (rtc_mips_set_mmss(xtime.tv_sec) == 0) { + last_rtc_update = xtime.tv_sec; + } + else { + /* do it again in 60 s */ + last_rtc_update = xtime.tv_sec - 600; + } + } +} + /* * local_timer_interrupt() does profiling and process accounting * on a per-CPU basis. @@ -172,7 +230,7 @@ irqreturn_t timer_interrupt(int irq, voi /* * If we have an externally synchronized Linux clock, then update - * CMOS clock accordingly every ~11 minutes. rtc_mips_set_time() has to be + * CMOS clock accordingly every ~11 minutes. rtc_set_time() has to be * called as close as possible to 500 ms before the new second starts. */ if (ntp_synced() && @@ -211,6 +269,15 @@ int (*perf_irq)(void) = null_perf_irq; EXPORT_SYMBOL(null_perf_irq); EXPORT_SYMBOL(perf_irq); +#ifdef CONFIG_HIGH_RES_TIMERS +void event_timer_handler(struct pt_regs *regs) +{ + c0_timer_ack(); + if (lapic_clockevent.event_handler) + lapic_clockevent.event_handler(regs,NULL); +} +#endif + asmlinkage void ll_timer_interrupt(int irq) { int r2 = cpu_has_mips_r2; @@ -224,6 +291,15 @@ asmlinkage void ll_timer_interrupt(int i * performance counter interrupt was pending, so we have to run the * performance counter interrupt handler anyway. */ +#ifdef CONFIG_HIGH_RES_TIMERS + /* + * Run the event handler + */ + if (!r2 || (read_c0_cause() & (1 << 26))) + if (lapic_clockevent.event_handler) + lapic_clockevent.event_handler(regs,NULL); +#endif + if (!r2 || (read_c0_cause() & (1 << 26))) if (perf_irq()) goto out; @@ -256,7 +332,7 @@ asmlinkage void ll_local_timer_interrupt * b) (optional) calibrate and set the mips_hpt_frequency * (only needed if you intended to use cpu counter as timer interrupt * source) - * 2) setup xtime based on rtc_mips_get_time(). + * 2) setup xtime based on rtc_get_time(). * 3) calculate a couple of cached variables for later usage * 4) plat_timer_setup() - * a) (optional) over-write any choices made above by time_init(). @@ -270,7 +346,7 @@ unsigned int mips_hpt_frequency; static struct irqaction timer_irqaction = { .handler = timer_interrupt, - .flags = IRQF_DISABLED, + .flags = IRQF_NODELAY | IRQF_DISABLED, .name = "timer", }; @@ -354,6 +430,9 @@ static void __init init_mips_clocksource void __init time_init(void) { +#ifdef CONFIG_HIGH_RES_TIMERS + u64 temp; +#endif if (board_time_init) board_time_init(); @@ -393,6 +472,12 @@ void __init time_init(void) /* Calculate cache parameters. */ cycles_per_jiffy = (mips_hpt_frequency + HZ / 2) / HZ; +#ifdef CONFIG_HIGH_RES_TIMERS + hrt_cycles_per_jiffy = ( (CONFIG_CPU_SPEED * 1000000) + HZ / 2) / HZ; +#else + hrt_cycles_per_jiffy = cycles_per_jiffy; +#endif + /* Report the high precision timer rate for a reference. */ printk("Using %u.%03u MHz high precision timer.\n", ((mips_hpt_frequency + 500) / 1000) / 1000, @@ -478,3 +563,128 @@ unsigned long long sched_clock(void) { return (unsigned long long)jiffies*(1000000000/HZ); } + + +#ifdef CONFIG_SMP +/* + * We have to synchronize the master CPU with all the slave CPUs + */ +static atomic_t cpus_started; +static atomic_t cpus_ready; +static atomic_t cpus_count; +/* + * Master processor inits + */ +static void sync_cpus_init(int v) +{ + atomic_set(&cpus_count, 0); + mb(); + atomic_set(&cpus_started, v); + mb(); + atomic_set(&cpus_ready, v); + mb(); +} + +/* + * Called by the master processor + */ +static void sync_cpus_master(int v) +{ + atomic_set(&cpus_count, 0); + mb(); + atomic_set(&cpus_started, v); + mb(); + /* Wait here till all other CPUs are now ready */ + while (atomic_read(&cpus_count) != (num_online_cpus() -1) ) + mb(); + atomic_set(&cpus_ready, v); + mb(); +} +/* + * Called by the slave processors + */ +static void sync_cpus_slave(int v) +{ + /* Check if the master has been through this */ + while (atomic_read(&cpus_started) != v) + mb(); + atomic_inc(&cpus_count); + mb(); + while (atomic_read(&cpus_ready) != v) + mb(); +} +/* + * Called by the slave CPUs when done syncing the count register + * with the master processor + */ +static void sync_cpus_slave_exit(int v) +{ + while (atomic_read(&cpus_started) != v) + mb(); + atomic_inc(&cpus_count); + mb(); +} + +#define LOOPS 100 +static u32 c0_count[NR_CPUS]; /* Count register per CPU */ +static u32 c[NR_CPUS][LOOPS + 1]; /* Count register per CPU per loop for syncing */ + +/* + * Slave processors execute this via IPI + */ +static void sync_c0_count_slave(void *info) +{ + int cpus = 1, loop, prev_count = 0, cpu = smp_processor_id(); + unsigned long flags; + u32 diff_count; /* CPU count registers are 32-bit */ + local_irq_save(flags); + + for(loop = 0; loop <= LOOPS; loop++) { + /* Sync with the Master processor */ + sync_cpus_slave(cpus++); + c[cpu][loop] = c0_count[cpu] = read_c0_count(); + mb(); + sync_cpus_slave(cpus++); + diff_count = c0_count[0] - c0_count[cpu]; + diff_count += prev_count; + diff_count += read_c0_count(); + write_c0_count(diff_count); + prev_count = (prev_count >> 1) + + ((int)(c0_count[0] - c0_count[cpu]) >> 1); + } + + /* Slave processor is done syncing count register with Master */ + sync_cpus_slave_exit(cpus++); + printk("SMP: Slave processor %d done syncing count \n", cpu); + local_irq_restore(flags); +} + +/* + * Master kicks off the syncing process + */ +void sync_c0_count_master(void) +{ + int cpus = 0, loop, cpu = smp_processor_id(); + unsigned long flags; + + printk("SMP: Starting to sync the c0 count register ... \n"); + sync_cpus_init(cpus++); + + /* Kick off the slave processors to also start the syncing process */ + smp_call_function(sync_c0_count_slave, NULL, 0, 0); + local_irq_save(flags); + + for (loop = 0; loop <= LOOPS; loop++) { + /* Wait for all the CPUs here */ + sync_cpus_master(cpus++); + c[cpu][loop] = c0_count[cpu] = read_c0_count(); + mb(); + /* Do syncing once more */ + sync_cpus_master(cpus++); + } + sync_cpus_master(cpus++); + local_irq_restore(flags); + + printk("SMP: Syncing process completed accross CPUs ... \n"); +} +#endif /* CONFIG_SMP */ Index: linux/arch/mips/kernel/traps.c =================================================================== --- linux.orig/arch/mips/kernel/traps.c +++ linux/arch/mips/kernel/traps.c @@ -304,7 +304,7 @@ void show_registers(struct pt_regs *regs printk("\n"); } -static DEFINE_SPINLOCK(die_lock); +static DEFINE_RAW_SPINLOCK(die_lock); NORET_TYPE void ATTRIB_NORET die(const char * str, struct pt_regs * regs) { Index: linux/arch/mips/mm/init.c =================================================================== --- linux.orig/arch/mips/mm/init.c +++ linux/arch/mips/mm/init.c @@ -59,7 +59,7 @@ #endif /* CONFIG_MIPS_MT_SMTC */ -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; Index: linux/arch/mips/sibyte/cfe/smp.c =================================================================== --- linux.orig/arch/mips/sibyte/cfe/smp.c +++ linux/arch/mips/sibyte/cfe/smp.c @@ -107,4 +107,8 @@ void prom_smp_finish(void) */ void prom_cpus_done(void) { +#ifdef CONFIG_HIGH_RES_TIMERS + extern void sync_c0_count_master(void); + sync_c0_count_master(); +#endif } Index: linux/arch/mips/sibyte/sb1250/irq.c =================================================================== --- linux.orig/arch/mips/sibyte/sb1250/irq.c +++ linux/arch/mips/sibyte/sb1250/irq.c @@ -84,7 +84,7 @@ static struct irq_chip sb1250_irq_type = /* Store the CPU id (not the logical number) */ int sb1250_irq_owner[SB1250_NR_IRQS]; -DEFINE_SPINLOCK(sb1250_imr_lock); +DEFINE_RAW_SPINLOCK(sb1250_imr_lock); void sb1250_mask_irq(int cpu, int irq) { @@ -260,7 +260,7 @@ static irqreturn_t sb1250_dummy_handler static struct irqaction sb1250_dummy_action = { .handler = sb1250_dummy_handler, - .flags = 0, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "sb1250-private", .next = NULL, @@ -370,6 +370,10 @@ void __init arch_init_irq(void) #ifdef CONFIG_KGDB imask |= STATUSF_IP6; #endif + +#ifdef CONFIG_HIGH_RES_TIMERS + imask |= STATUSF_IP7; +#endif /* Enable necessary IPs, disable the rest */ change_c0_status(ST0_IM, imask); @@ -447,6 +451,10 @@ asmlinkage void plat_irq_dispatch(void) else #endif +#ifdef CONFIG_HIGH_RES_TIMERS + if (pending & CAUSEF_IP7) + event_timer_handler(regs); +#endif if (pending & CAUSEF_IP4) sb1250_timer_interrupt(); Index: linux/arch/mips/sibyte/sb1250/smp.c =================================================================== --- linux.orig/arch/mips/sibyte/sb1250/smp.c +++ linux/arch/mips/sibyte/sb1250/smp.c @@ -59,7 +59,7 @@ void sb1250_smp_finish(void) { extern void sb1250_time_init(void); sb1250_time_init(); - local_irq_enable(); + raw_local_irq_enable(); } /* Index: linux/arch/mips/sibyte/swarm/setup.c =================================================================== --- linux.orig/arch/mips/sibyte/swarm/setup.c +++ linux/arch/mips/sibyte/swarm/setup.c @@ -131,6 +131,12 @@ void __init plat_mem_setup(void) rtc_mips_set_time = m41t81_set_time; } +#ifdef CONFIG_HIGH_RES_TIMERS + /* + * set the mips_hpt_frequency here + */ + mips_hpt_frequency = CONFIG_CPU_SPEED * 1000000; +#endif printk("This kernel optimized for " #ifdef CONFIG_SIMULATION "simulation" Index: linux/arch/powerpc/Kconfig =================================================================== --- linux.orig/arch/powerpc/Kconfig +++ linux/arch/powerpc/Kconfig @@ -26,18 +26,15 @@ config MMU bool default y -config GENERIC_HARDIRQS +config GENERIC_TIME bool default y -config IRQ_PER_CPU +config GENERIC_HARDIRQS bool default y -config RWSEM_GENERIC_SPINLOCK - bool - -config RWSEM_XCHGADD_ALGORITHM +config IRQ_PER_CPU bool default y @@ -619,6 +616,18 @@ config HIGHMEM source kernel/Kconfig.hz source kernel/Kconfig.preempt + +config RWSEM_GENERIC_SPINLOCK + bool + default y + +config ASM_SEMAPHORES + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + bool + source "fs/Kconfig.binfmt" # We optimistically allocate largepages from the VM, so make the limit Index: linux/arch/powerpc/Kconfig.debug =================================================================== --- linux.orig/arch/powerpc/Kconfig.debug +++ linux/arch/powerpc/Kconfig.debug @@ -2,6 +2,10 @@ menu "Kernel hacking" source "lib/Kconfig.debug" +config TRACE_IRQFLAGS_SUPPORT + bool + default y + config DEBUG_STACKOVERFLOW bool "Check for stack overflows" depends on DEBUG_KERNEL && PPC64 Index: linux/arch/powerpc/boot/Makefile =================================================================== --- linux.orig/arch/powerpc/boot/Makefile +++ linux/arch/powerpc/boot/Makefile @@ -33,6 +33,14 @@ endif BOOTCFLAGS += -I$(obj) -I$(srctree)/$(obj) +ifdef CONFIG_MCOUNT +# do not trace the boot loader +nullstring := +space := $(nullstring) # end of the line +pg_flag = $(nullstring) -pg # end of the line +CFLAGS := $(subst ${pg_flag},${space},${CFLAGS}) +endif + zlib := inffast.c inflate.c inftrees.c zlibheader := inffast.h inffixed.h inflate.h inftrees.h infutil.h zliblinuxheader := zlib.h zconf.h zutil.h @@ -50,7 +58,7 @@ obj-wlib := $(addsuffix .o, $(basename $ obj-plat := $(addsuffix .o, $(basename $(addprefix $(obj)/, $(src-plat)))) quiet_cmd_copy_zlib = COPY $@ - cmd_copy_zlib = sed "s@__attribute_used__@@;s@]\+\).*@\"\1\"@" $< > $@ + cmd_copy_zlib = sed "s@__attribute_used__@@;s@.include.@@;s@.include.@@;s@.*spin.*lock.*@@;s@.*SPINLOCK.*@@;s@]\+\).*@\"\1\"@" $< > $@ quiet_cmd_copy_zlibheader = COPY $@ cmd_copy_zlibheader = sed "s@]\+\).*@\"\1\"@" $< > $@ Index: linux/arch/powerpc/kernel/Makefile =================================================================== --- linux.orig/arch/powerpc/kernel/Makefile +++ linux/arch/powerpc/kernel/Makefile @@ -10,10 +10,11 @@ CFLAGS_prom_init.o += -fPIC CFLAGS_btext.o += -fPIC endif -obj-y := semaphore.o cputable.o ptrace.o syscalls.o \ +obj-y := cputable.o ptrace.o syscalls.o \ irq.o align.o signal_32.o pmc.o vdso.o \ init_task.o process.o systbl.o idle.o obj-y += vdso32/ +obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o obj-$(CONFIG_PPC64) += setup_64.o binfmt_elf32.o sys_ppc32.o \ signal_64.o ptrace32.o \ paca.o cpu_setup_ppc970.o \ Index: linux/arch/powerpc/kernel/entry_32.S =================================================================== --- linux.orig/arch/powerpc/kernel/entry_32.S +++ linux/arch/powerpc/kernel/entry_32.S @@ -638,7 +638,7 @@ user_exc_return: /* r10 contains MSR_KE /* Check current_thread_info()->flags */ rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) - andi. r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED) + andi. r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne do_work restore_user: @@ -856,7 +856,7 @@ load_dbcr0: #endif /* !(CONFIG_4xx || CONFIG_BOOKE) */ do_work: /* r10 contains MSR_KERNEL here */ - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beq do_user_signal do_resched: /* r10 contains MSR_KERNEL here */ @@ -870,7 +870,7 @@ recheck: MTMSRD(r10) /* disable interrupts */ rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne- do_resched andi. r0,r9,_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK beq restore_user @@ -978,3 +978,85 @@ machine_check_in_rtas: /* XXX load up BATs and panic */ #endif /* CONFIG_PPC_RTAS */ + +#ifdef CONFIG_MCOUNT +/* + * mcount() is not the same as _mcount(). The callers of mcount() have a + * normal context. The callers of _mcount() do not have a stack frame and + * have not saved the "caller saves" registers. + */ +_GLOBAL(mcount) + stwu r1,-16(r1) + mflr r3 + lis r5,mcount_enabled@ha + lwz r5,mcount_enabled@l(r5) + stw r3,20(r1) + cmpwi r5,0 + beq 1f + /* r3 contains lr (eip), put parent lr (parent_eip) in r4 */ + lwz r4,16(r1) + lwz r4,4(r4) + bl __trace +1: + lwz r0,20(r1) + mtlr r0 + addi r1,r1,16 + blr + +/* + * The -pg flag, which is specified in the case of CONFIG_MCOUNT, causes the + * C compiler to add a call to _mcount() at the start of each function + * preamble, before the stack frame is created. An example of this preamble + * code is: + * + * mflr r0 + * lis r12,-16354 + * stw r0,4(r1) + * addi r0,r12,-19652 + * bl 0xc00034c8 <_mcount> + * mflr r0 + * stwu r1,-16(r1) + */ +_GLOBAL(_mcount) +#define M_STK_SIZE 48 + /* Would not expect to need to save cr, but glibc version of */ + /* _mcount() does, so cautiously saving it here too. */ + stwu r1,-M_STK_SIZE(r1) + stw r3, 12(r1) + stw r4, 16(r1) + stw r5, 20(r1) + stw r6, 24(r1) + mflr r3 /* will use as first arg to __trace() */ + mfcr r4 + lis r5,mcount_enabled@ha + lwz r5,mcount_enabled@l(r5) + cmpwi r5,0 + stw r3, 44(r1) /* lr */ + stw r4, 8(r1) /* cr */ + stw r7, 28(r1) + stw r8, 32(r1) + stw r9, 36(r1) + stw r10,40(r1) + beq 1f + /* r3 contains lr (eip), put parent lr (parent_eip) in r4 */ + lwz r4,M_STK_SIZE+4(r1) + bl __trace +1: + lwz r8, 8(r1) /* cr */ + lwz r9, 44(r1) /* lr */ + lwz r3, 12(r1) + lwz r4, 16(r1) + lwz r5, 20(r1) + mtcrf 0xff,r8 + mtctr r9 + lwz r0, 52(r1) + lwz r6, 24(r1) + lwz r7, 28(r1) + lwz r8, 32(r1) + lwz r9, 36(r1) + lwz r10,40(r1) + addi r1,r1,M_STK_SIZE + mtlr r0 + bctr + +#endif /* CONFIG_MCOUNT */ Index: linux/arch/powerpc/kernel/idle.c =================================================================== --- linux.orig/arch/powerpc/kernel/idle.c +++ linux/arch/powerpc/kernel/idle.c @@ -82,7 +82,7 @@ void cpu_idle(void) ppc64_runlatch_on(); if (cpu_should_die()) cpu_die(); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); } Index: linux/arch/powerpc/kernel/irq.c =================================================================== --- linux.orig/arch/powerpc/kernel/irq.c +++ linux/arch/powerpc/kernel/irq.c @@ -92,8 +92,6 @@ extern atomic_t ipi_sent; #endif #ifdef CONFIG_PPC64 -EXPORT_SYMBOL(irq_desc); - int distribute_irqs = 1; #endif /* CONFIG_PPC64 */ Index: linux/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- linux.orig/arch/powerpc/kernel/ppc_ksyms.c +++ linux/arch/powerpc/kernel/ppc_ksyms.c @@ -16,7 +16,6 @@ #include #include -#include #include #include #include @@ -174,7 +173,6 @@ EXPORT_SYMBOL(screen_info); #ifdef CONFIG_PPC32 EXPORT_SYMBOL(timer_interrupt); -EXPORT_SYMBOL(irq_desc); EXPORT_SYMBOL(tb_ticks_per_jiffy); EXPORT_SYMBOL(console_drivers); EXPORT_SYMBOL(cacheable_memcpy); Index: linux/arch/powerpc/kernel/semaphore.c =================================================================== --- linux.orig/arch/powerpc/kernel/semaphore.c +++ linux/arch/powerpc/kernel/semaphore.c @@ -31,7 +31,7 @@ * sem->count = tmp; * return old_count; */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -50,7 +50,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -63,7 +63,7 @@ void __up(struct semaphore *sem) __sem_update_count(sem, 1); wake_up(&sem->wait); } -EXPORT_SYMBOL(__up); +EXPORT_SYMBOL(__compat_up); /* * Note that when we come in to __down or __down_interruptible, @@ -73,7 +73,7 @@ EXPORT_SYMBOL(__up); * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -101,9 +101,9 @@ void __sched __down(struct semaphore *se */ wake_up(&sem->wait); } -EXPORT_SYMBOL(__down); +EXPORT_SYMBOL(__compat_down); -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore *sem) { int retval = 0; struct task_struct *tsk = current; @@ -132,4 +132,10 @@ int __sched __down_interruptible(struct wake_up(&sem->wait); return retval; } -EXPORT_SYMBOL(__down_interruptible); +EXPORT_SYMBOL(__compat_down_interruptible); + +int compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} +EXPORT_SYMBOL(compat_sem_is_locked); Index: linux/arch/powerpc/kernel/smp.c =================================================================== --- linux.orig/arch/powerpc/kernel/smp.c +++ linux/arch/powerpc/kernel/smp.c @@ -148,6 +148,16 @@ void smp_send_reschedule(int cpu) smp_ops->message_pass(cpu, PPC_MSG_RESCHEDULE); } +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + smp_ops->message_pass(MSG_ALL_BUT_SELF, PPC_MSG_RESCHEDULE); +} + #ifdef CONFIG_DEBUGGER void smp_send_debugger_break(int cpu) { @@ -184,7 +194,7 @@ void smp_send_stop(void) * static memory requirements. It also looks cleaner. * Stolen from the i386 version. */ -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_lock); +static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(call_lock); static struct call_data_struct { void (*func) (void *info); Index: linux/arch/powerpc/kernel/time.c =================================================================== --- linux.orig/arch/powerpc/kernel/time.c +++ linux/arch/powerpc/kernel/time.c @@ -74,6 +74,9 @@ #endif #include +unsigned long cpu_khz; /* Detected as we calibrate the TSC */ +EXPORT_SYMBOL(cpu_khz); + /* keep track of when we need to update the rtc */ time_t last_rtc_update; #ifdef CONFIG_PPC_ISERIES @@ -116,8 +119,6 @@ EXPORT_SYMBOL_GPL(rtc_lock); u64 tb_to_ns_scale; unsigned tb_to_ns_shift; -struct gettimeofday_struct do_gtod; - extern struct timezone sys_tz; static long timezone_offset; @@ -375,162 +376,8 @@ static __inline__ void timer_check_rtc(v } } -/* - * This version of gettimeofday has microsecond resolution. - */ -static inline void __do_gettimeofday(struct timeval *tv) -{ - unsigned long sec, usec; - u64 tb_ticks, xsec; - struct gettimeofday_vars *temp_varp; - u64 temp_tb_to_xs, temp_stamp_xsec; - - /* - * These calculations are faster (gets rid of divides) - * if done in units of 1/2^20 rather than microseconds. - * The conversion to microseconds at the end is done - * without a divide (and in fact, without a multiply) - */ - temp_varp = do_gtod.varp; - - /* Sampling the time base must be done after loading - * do_gtod.varp in order to avoid racing with update_gtod. - */ - data_barrier(temp_varp); - tb_ticks = get_tb() - temp_varp->tb_orig_stamp; - temp_tb_to_xs = temp_varp->tb_to_xs; - temp_stamp_xsec = temp_varp->stamp_xsec; - xsec = temp_stamp_xsec + mulhdu(tb_ticks, temp_tb_to_xs); - sec = xsec / XSEC_PER_SEC; - usec = (unsigned long)xsec & (XSEC_PER_SEC - 1); - usec = SCALE_XSEC(usec, 1000000); - - tv->tv_sec = sec; - tv->tv_usec = usec; -} - -void do_gettimeofday(struct timeval *tv) -{ - if (__USE_RTC()) { - /* do this the old way */ - unsigned long flags, seq; - unsigned int sec, nsec, usec; - - do { - seq = read_seqbegin_irqsave(&xtime_lock, flags); - sec = xtime.tv_sec; - nsec = xtime.tv_nsec + tb_ticks_since(tb_last_jiffy); - } while (read_seqretry_irqrestore(&xtime_lock, seq, flags)); - usec = nsec / 1000; - while (usec >= 1000000) { - usec -= 1000000; - ++sec; - } - tv->tv_sec = sec; - tv->tv_usec = usec; - return; - } - __do_gettimeofday(tv); -} - -EXPORT_SYMBOL(do_gettimeofday); - -/* - * There are two copies of tb_to_xs and stamp_xsec so that no - * lock is needed to access and use these values in - * do_gettimeofday. We alternate the copies and as long as a - * reasonable time elapses between changes, there will never - * be inconsistent values. ntpd has a minimum of one minute - * between updates. - */ -static inline void update_gtod(u64 new_tb_stamp, u64 new_stamp_xsec, - u64 new_tb_to_xs) -{ - unsigned temp_idx; - struct gettimeofday_vars *temp_varp; - - temp_idx = (do_gtod.var_idx == 0); - temp_varp = &do_gtod.vars[temp_idx]; - - temp_varp->tb_to_xs = new_tb_to_xs; - temp_varp->tb_orig_stamp = new_tb_stamp; - temp_varp->stamp_xsec = new_stamp_xsec; - smp_mb(); - do_gtod.varp = temp_varp; - do_gtod.var_idx = temp_idx; - - /* - * tb_update_count is used to allow the userspace gettimeofday code - * to assure itself that it sees a consistent view of the tb_to_xs and - * stamp_xsec variables. It reads the tb_update_count, then reads - * tb_to_xs and stamp_xsec and then reads tb_update_count again. If - * the two values of tb_update_count match and are even then the - * tb_to_xs and stamp_xsec values are consistent. If not, then it - * loops back and reads them again until this criteria is met. - * We expect the caller to have done the first increment of - * vdso_data->tb_update_count already. - */ - vdso_data->tb_orig_stamp = new_tb_stamp; - vdso_data->stamp_xsec = new_stamp_xsec; - vdso_data->tb_to_xs = new_tb_to_xs; - vdso_data->wtom_clock_sec = wall_to_monotonic.tv_sec; - vdso_data->wtom_clock_nsec = wall_to_monotonic.tv_nsec; - smp_wmb(); - ++(vdso_data->tb_update_count); -} - -/* - * When the timebase - tb_orig_stamp gets too big, we do a manipulation - * between tb_orig_stamp and stamp_xsec. The goal here is to keep the - * difference tb - tb_orig_stamp small enough to always fit inside a - * 32 bits number. This is a requirement of our fast 32 bits userland - * implementation in the vdso. If we "miss" a call to this function - * (interrupt latency, CPU locked in a spinlock, ...) and we end up - * with a too big difference, then the vdso will fallback to calling - * the syscall - */ -static __inline__ void timer_recalc_offset(u64 cur_tb) -{ - unsigned long offset; - u64 new_stamp_xsec; - u64 tlen, t2x; - u64 tb, xsec_old, xsec_new; - struct gettimeofday_vars *varp; - - if (__USE_RTC()) - return; - tlen = current_tick_length(); - offset = cur_tb - do_gtod.varp->tb_orig_stamp; - if (tlen == last_tick_len && offset < 0x80000000u) - return; - if (tlen != last_tick_len) { - t2x = mulhdu(tlen << TICKLEN_SHIFT, ticklen_to_xs); - last_tick_len = tlen; - } else - t2x = do_gtod.varp->tb_to_xs; - new_stamp_xsec = (u64) xtime.tv_nsec * XSEC_PER_SEC; - do_div(new_stamp_xsec, 1000000000); - new_stamp_xsec += (u64) xtime.tv_sec * XSEC_PER_SEC; - - ++vdso_data->tb_update_count; - smp_mb(); - - /* - * Make sure time doesn't go backwards for userspace gettimeofday. - */ - tb = get_tb(); - varp = do_gtod.varp; - xsec_old = mulhdu(tb - varp->tb_orig_stamp, varp->tb_to_xs) - + varp->stamp_xsec; - xsec_new = mulhdu(tb - cur_tb, t2x) + new_stamp_xsec; - if (xsec_new < xsec_old) - new_stamp_xsec += xsec_old - xsec_new; - - update_gtod(cur_tb, new_stamp_xsec, t2x); -} - #ifdef CONFIG_SMP -unsigned long profile_pc(struct pt_regs *regs) +unsigned long notrace profile_pc(struct pt_regs *regs) { unsigned long pc = instruction_pointer(regs); @@ -578,11 +425,7 @@ static void iSeries_tb_recal(void) tb_ticks_per_sec = new_tb_ticks_per_sec; calc_cputime_factors(); div128_by_32( XSEC_PER_SEC, 0, tb_ticks_per_sec, &divres ); - do_gtod.tb_ticks_per_sec = tb_ticks_per_sec; tb_to_xs = divres.result_low; - do_gtod.varp->tb_to_xs = tb_to_xs; - vdso_data->tb_ticks_per_sec = tb_ticks_per_sec; - vdso_data->tb_to_xs = tb_to_xs; } else { printk( "Titan recalibrate: FAILED (difference > 4 percent)\n" @@ -664,7 +507,7 @@ void timer_interrupt(struct pt_regs * re if (per_cpu(last_jiffy, cpu) >= tb_next_jiffy) { tb_last_jiffy = tb_next_jiffy; do_timer(1); - timer_recalc_offset(tb_last_jiffy); + /*timer_recalc_offset(tb_last_jiffy);*/ timer_check_rtc(); } write_sequnlock(&xtime_lock); @@ -752,77 +595,6 @@ unsigned long long sched_clock(void) return mulhdu(get_tb(), tb_to_ns_scale) << tb_to_ns_shift; } -int do_settimeofday(struct timespec *tv) -{ - time_t wtm_sec, new_sec = tv->tv_sec; - long wtm_nsec, new_nsec = tv->tv_nsec; - unsigned long flags; - u64 new_xsec; - unsigned long tb_delta; - - if ((unsigned long)tv->tv_nsec >= NSEC_PER_SEC) - return -EINVAL; - - write_seqlock_irqsave(&xtime_lock, flags); - - /* - * Updating the RTC is not the job of this code. If the time is - * stepped under NTP, the RTC will be updated after STA_UNSYNC - * is cleared. Tools like clock/hwclock either copy the RTC - * to the system time, in which case there is no point in writing - * to the RTC again, or write to the RTC but then they don't call - * settimeofday to perform this operation. - */ -#ifdef CONFIG_PPC_ISERIES - if (first_settimeofday) { - iSeries_tb_recal(); - first_settimeofday = 0; - } -#endif - - /* Make userspace gettimeofday spin until we're done. */ - ++vdso_data->tb_update_count; - smp_mb(); - - /* - * Subtract off the number of nanoseconds since the - * beginning of the last tick. - */ - tb_delta = tb_ticks_since(tb_last_jiffy); - tb_delta = mulhdu(tb_delta, do_gtod.varp->tb_to_xs); /* in xsec */ - new_nsec -= SCALE_XSEC(tb_delta, 1000000000); - - wtm_sec = wall_to_monotonic.tv_sec + (xtime.tv_sec - new_sec); - wtm_nsec = wall_to_monotonic.tv_nsec + (xtime.tv_nsec - new_nsec); - - set_normalized_timespec(&xtime, new_sec, new_nsec); - set_normalized_timespec(&wall_to_monotonic, wtm_sec, wtm_nsec); - - /* In case of a large backwards jump in time with NTP, we want the - * clock to be updated as soon as the PLL is again in lock. - */ - last_rtc_update = new_sec - 658; - - ntp_clear(); - - new_xsec = xtime.tv_nsec; - if (new_xsec != 0) { - new_xsec *= XSEC_PER_SEC; - do_div(new_xsec, NSEC_PER_SEC); - } - new_xsec += (u64)xtime.tv_sec * XSEC_PER_SEC; - update_gtod(tb_last_jiffy, new_xsec, do_gtod.varp->tb_to_xs); - - vdso_data->tz_minuteswest = sys_tz.tz_minuteswest; - vdso_data->tz_dsttime = sys_tz.tz_dsttime; - - write_sequnlock_irqrestore(&xtime_lock, flags); - clock_was_set(); - return 0; -} - -EXPORT_SYMBOL(do_settimeofday); - static int __init get_freq(char *name, int cells, unsigned long *val) { struct device_node *cpu; @@ -920,6 +692,7 @@ void __init time_init(void) tb_ticks_per_jiffy = ppc_tb_freq / HZ; tb_ticks_per_sec = ppc_tb_freq; tb_ticks_per_usec = ppc_tb_freq / 1000000; + cpu_khz = ppc_tb_freq / 1000; tb_to_us = mulhwu_scale_factor(ppc_tb_freq, 1000000); calc_cputime_factors(); @@ -988,20 +761,6 @@ void __init time_init(void) xtime.tv_sec = tm; xtime.tv_nsec = 0; - do_gtod.varp = &do_gtod.vars[0]; - do_gtod.var_idx = 0; - do_gtod.varp->tb_orig_stamp = tb_last_jiffy; - __get_cpu_var(last_jiffy) = tb_last_jiffy; - do_gtod.varp->stamp_xsec = (u64) xtime.tv_sec * XSEC_PER_SEC; - do_gtod.tb_ticks_per_sec = tb_ticks_per_sec; - do_gtod.varp->tb_to_xs = tb_to_xs; - do_gtod.tb_to_us = tb_to_us; - - vdso_data->tb_orig_stamp = tb_last_jiffy; - vdso_data->tb_update_count = 0; - vdso_data->tb_ticks_per_sec = tb_ticks_per_sec; - vdso_data->stamp_xsec = (u64) xtime.tv_sec * XSEC_PER_SEC; - vdso_data->tb_to_xs = tb_to_xs; time_freq = 0; @@ -1014,7 +773,6 @@ void __init time_init(void) set_dec(tb_ticks_per_jiffy); } - #define FEBRUARY 2 #define STARTOFTIME 1970 #define SECDAY 86400L @@ -1159,3 +917,36 @@ void div128_by_32(u64 dividend_high, u64 dr->result_low = ((u64)y << 32) + z; } + +/* PowerPC clocksource code */ + +#include + +static cycle_t timebase_read(void) +{ + return (cycle_t)get_tb(); +} + +struct clocksource clocksource_timebase = { + .name = "timebase", + .rating = 200, + .read = timebase_read, + .mask = (cycle_t)-1, + .mult = 0, + .shift = 22, + .is_continuous = 1, +}; + + +/* XXX - this should be calculated or properly externed! */ +static int __init init_timebase_clocksource(void) +{ + if (__USE_RTC()) + return -ENODEV; + + clocksource_timebase.mult = clocksource_hz2mult(tb_ticks_per_sec, + clocksource_timebase.shift); + return clocksource_register(&clocksource_timebase); +} + +module_init(init_timebase_clocksource); Index: linux/arch/powerpc/kernel/traps.c =================================================================== --- linux.orig/arch/powerpc/kernel/traps.c +++ linux/arch/powerpc/kernel/traps.c @@ -93,7 +93,7 @@ EXPORT_SYMBOL(unregister_die_notifier); * Trap & Exception support */ -static DEFINE_SPINLOCK(die_lock); +static DEFINE_RAW_SPINLOCK(die_lock); int die(const char *str, struct pt_regs *regs, long err) { @@ -164,6 +164,11 @@ void _exception(int signr, struct pt_reg return; } +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif + memset(&info, 0, sizeof(info)); info.si_signo = signr; info.si_code = code; Index: linux/arch/powerpc/lib/locks.c =================================================================== --- linux.orig/arch/powerpc/lib/locks.c +++ linux/arch/powerpc/lib/locks.c @@ -25,7 +25,7 @@ #include #include -void __spin_yield(raw_spinlock_t *lock) +void __spin_yield(__raw_spinlock_t *lock) { unsigned int lock_value, holder_cpu, yield_count; @@ -78,7 +78,7 @@ void __rw_yield(raw_rwlock_t *rw) } #endif -void __raw_spin_unlock_wait(raw_spinlock_t *lock) +void __raw_spin_unlock_wait(__raw_spinlock_t *lock) { while (lock->slock) { HMT_low(); Index: linux/arch/powerpc/mm/fault.c =================================================================== --- linux.orig/arch/powerpc/mm/fault.c +++ linux/arch/powerpc/mm/fault.c @@ -149,8 +149,8 @@ static void do_dabr(struct pt_regs *regs * The return value is 0 if the fault was handled, or the signal * number if this is a kernel fault that can't be handled here. */ -int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, - unsigned long error_code) +int __kprobes notrace do_page_fault(struct pt_regs *regs, + unsigned long address, unsigned long error_code) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; Index: linux/arch/powerpc/mm/init_32.c =================================================================== --- linux.orig/arch/powerpc/mm/init_32.c +++ linux/arch/powerpc/mm/init_32.c @@ -56,7 +56,7 @@ #endif #define MAX_LOW_MEM CONFIG_LOWMEM_SIZE -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long total_memory; unsigned long total_lowmem; Index: linux/arch/powerpc/mm/tlb_64.c =================================================================== --- linux.orig/arch/powerpc/mm/tlb_64.c +++ linux/arch/powerpc/mm/tlb_64.c @@ -37,7 +37,7 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p /* This is declared as we are using the more or less generic * include/asm-powerpc/tlb.h file -- tgall */ -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); unsigned long pte_freelist_forced_free; Index: linux/arch/powerpc/platforms/cell/smp.c =================================================================== --- linux.orig/arch/powerpc/platforms/cell/smp.c +++ linux/arch/powerpc/platforms/cell/smp.c @@ -133,7 +133,7 @@ static void __devinit smp_iic_setup_cpu( iic_setup_cpu(); } -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned long timebase = 0; static void __devinit cell_give_timebase(void) Index: linux/arch/powerpc/platforms/chrp/smp.c =================================================================== --- linux.orig/arch/powerpc/platforms/chrp/smp.c +++ linux/arch/powerpc/platforms/chrp/smp.c @@ -45,7 +45,7 @@ static void __devinit smp_chrp_setup_cpu mpic_setup_this_cpu(); } -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned int timebase_upper = 0, timebase_lower = 0; void __devinit smp_chrp_give_timebase(void) Index: linux/arch/powerpc/platforms/chrp/time.c =================================================================== --- linux.orig/arch/powerpc/platforms/chrp/time.c +++ linux/arch/powerpc/platforms/chrp/time.c @@ -27,7 +27,7 @@ #include #include -extern spinlock_t rtc_lock; +extern raw_spinlock_t rtc_lock; static int nvram_as1 = NVRAM_AS1; static int nvram_as0 = NVRAM_AS0; Index: linux/arch/powerpc/platforms/iseries/setup.c =================================================================== --- linux.orig/arch/powerpc/platforms/iseries/setup.c +++ linux/arch/powerpc/platforms/iseries/setup.c @@ -595,12 +595,14 @@ static void yield_shared_processor(void) static void iseries_shared_idle(void) { while (1) { - while (!need_resched() && !hvlpevent_is_pending()) { + while (!need_resched() && !need_resched_delayed() + && !hvlpevent_is_pending()) { local_irq_disable(); ppc64_runlatch_off(); /* Recheck with irqs off */ - if (!need_resched() && !hvlpevent_is_pending()) + if (!need_resched() && !need_resched_delayed() + && !hvlpevent_is_pending()) yield_shared_processor(); HMT_medium(); Index: linux/arch/powerpc/platforms/powermac/feature.c =================================================================== --- linux.orig/arch/powerpc/platforms/powermac/feature.c +++ linux/arch/powerpc/platforms/powermac/feature.c @@ -59,7 +59,7 @@ extern struct device_node *k2_skiplist[2 * We use a single global lock to protect accesses. Each driver has * to take care of its own locking */ -DEFINE_SPINLOCK(feature_lock); +DEFINE_RAW_SPINLOCK(feature_lock); #define LOCK(flags) spin_lock_irqsave(&feature_lock, flags); #define UNLOCK(flags) spin_unlock_irqrestore(&feature_lock, flags); Index: linux/arch/powerpc/platforms/powermac/nvram.c =================================================================== --- linux.orig/arch/powerpc/platforms/powermac/nvram.c +++ linux/arch/powerpc/platforms/powermac/nvram.c @@ -80,7 +80,7 @@ static int is_core_99; static int core99_bank = 0; static int nvram_partitions[3]; // XXX Turn that into a sem -static DEFINE_SPINLOCK(nv_lock); +static DEFINE_RAW_SPINLOCK(nv_lock); static int (*core99_write_bank)(int bank, u8* datas); static int (*core99_erase_bank)(int bank); Index: linux/arch/powerpc/platforms/powermac/pic.c =================================================================== --- linux.orig/arch/powerpc/platforms/powermac/pic.c +++ linux/arch/powerpc/platforms/powermac/pic.c @@ -63,7 +63,7 @@ static int max_irqs; static int max_real_irqs; static u32 level_mask[4]; -static DEFINE_SPINLOCK(pmac_pic_lock); +static DEFINE_RAW_SPINLOCK(pmac_pic_lock); #define NR_MASK_WORDS ((NR_IRQS + 31) / 32) static unsigned long ppc_lost_interrupts[NR_MASK_WORDS]; @@ -305,8 +305,6 @@ static int pmac_pic_host_map(struct irq_ level = !!(level_mask[hw >> 5] & (1UL << (hw & 0x1f))); if (level) desc->status |= IRQ_LEVEL; - else - desc->status |= IRQ_DELAYED_DISABLE; set_irq_chip_and_handler(virq, &pmac_pic, level ? handle_level_irq : handle_edge_irq); return 0; Index: linux/arch/powerpc/platforms/pseries/setup.c =================================================================== --- linux.orig/arch/powerpc/platforms/pseries/setup.c +++ linux/arch/powerpc/platforms/pseries/setup.c @@ -495,7 +495,8 @@ static void pseries_dedicated_idle_sleep set_thread_flag(TIF_POLLING_NRFLAG); while (get_tb() < start_snooze) { - if (need_resched() || cpu_is_offline(cpu)) + if (need_resched() || need_resched_delayed() || + cpu_is_offline(cpu)) goto out; ppc64_runlatch_off(); HMT_low(); @@ -506,7 +507,8 @@ static void pseries_dedicated_idle_sleep clear_thread_flag(TIF_POLLING_NRFLAG); smp_mb(); local_irq_disable(); - if (need_resched() || cpu_is_offline(cpu)) + if (need_resched() || need_resched_delayed() || + cpu_is_offline(cpu)) goto out; } Index: linux/arch/powerpc/platforms/pseries/smp.c =================================================================== --- linux.orig/arch/powerpc/platforms/pseries/smp.c +++ linux/arch/powerpc/platforms/pseries/smp.c @@ -344,7 +344,7 @@ static void __devinit smp_xics_setup_cpu } #endif /* CONFIG_XICS */ -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned long timebase = 0; static void __devinit pSeries_give_timebase(void) Index: linux/arch/ppc/8260_io/enet.c =================================================================== --- linux.orig/arch/ppc/8260_io/enet.c +++ linux/arch/ppc/8260_io/enet.c @@ -116,7 +116,7 @@ struct scc_enet_private { scc_t *sccp; struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; }; static int scc_enet_open(struct net_device *dev); Index: linux/arch/ppc/8260_io/fcc_enet.c =================================================================== --- linux.orig/arch/ppc/8260_io/fcc_enet.c +++ linux/arch/ppc/8260_io/fcc_enet.c @@ -376,7 +376,7 @@ struct fcc_enet_private { volatile fcc_enet_t *ep; struct net_device_stats stats; uint tx_free; - spinlock_t lock; + raw_spinlock_t lock; #ifdef CONFIG_USE_MDIO uint phy_id; Index: linux/arch/ppc/8xx_io/commproc.c =================================================================== --- linux.orig/arch/ppc/8xx_io/commproc.c +++ linux/arch/ppc/8xx_io/commproc.c @@ -355,7 +355,7 @@ cpm_setbrg(uint brg, uint rate) /* * dpalloc / dpfree bits. */ -static spinlock_t cpm_dpmem_lock; +static raw_spinlock_t cpm_dpmem_lock; /* * 16 blocks should be enough to satisfy all requests * until the memory subsystem goes up... Index: linux/arch/ppc/8xx_io/enet.c =================================================================== --- linux.orig/arch/ppc/8xx_io/enet.c +++ linux/arch/ppc/8xx_io/enet.c @@ -143,7 +143,7 @@ struct scc_enet_private { unsigned char *rx_vaddr[RX_RING_SIZE]; struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; }; static int scc_enet_open(struct net_device *dev); Index: linux/arch/ppc/8xx_io/fec.c =================================================================== --- linux.orig/arch/ppc/8xx_io/fec.c +++ linux/arch/ppc/8xx_io/fec.c @@ -164,7 +164,7 @@ struct fec_enet_private { struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; #ifdef CONFIG_USE_MDIO uint phy_id; Index: linux/arch/ppc/Kconfig =================================================================== --- linux.orig/arch/ppc/Kconfig +++ linux/arch/ppc/Kconfig @@ -12,13 +12,6 @@ config GENERIC_HARDIRQS bool default y -config RWSEM_GENERIC_SPINLOCK - bool - -config RWSEM_XCHGADD_ALGORITHM - bool - default y - config GENERIC_HWEIGHT bool default y @@ -958,6 +951,18 @@ config ARCH_POPULATES_NODE_MAP source kernel/Kconfig.hz source kernel/Kconfig.preempt + +config RWSEM_GENERIC_SPINLOCK + bool + default y + +config ASM_SEMAPHORES + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + bool + source "mm/Kconfig" source "fs/Kconfig.binfmt" Index: linux/arch/ppc/boot/Makefile =================================================================== --- linux.orig/arch/ppc/boot/Makefile +++ linux/arch/ppc/boot/Makefile @@ -14,6 +14,15 @@ # CFLAGS += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include + +ifdef CONFIG_MCOUNT +# do not trace the boot loader +nullstring := +space := $(nullstring) # end of the line +pg_flag = $(nullstring) -pg # end of the line +CFLAGS := $(subst ${pg_flag},${space},${CFLAGS}) +endif + HOSTCFLAGS += -Iarch/$(ARCH)/boot/include BOOT_TARGETS = zImage zImage.initrd znetboot znetboot.initrd Index: linux/arch/ppc/kernel/dma-mapping.c =================================================================== --- linux.orig/arch/ppc/kernel/dma-mapping.c +++ linux/arch/ppc/kernel/dma-mapping.c @@ -70,7 +70,7 @@ int map_page(unsigned long va, phys_addr * This is the page table (2MB) covering uncached, DMA consistent allocations */ static pte_t *consistent_pte; -static DEFINE_SPINLOCK(consistent_lock); +static DEFINE_RAW_SPINLOCK(consistent_lock); /* * VM region handling support. Index: linux/arch/ppc/kernel/entry.S =================================================================== --- linux.orig/arch/ppc/kernel/entry.S +++ linux/arch/ppc/kernel/entry.S @@ -856,7 +856,7 @@ load_dbcr0: #endif /* !(CONFIG_4xx || CONFIG_BOOKE) */ do_work: /* r10 contains MSR_KERNEL here */ - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beq do_user_signal do_resched: /* r10 contains MSR_KERNEL here */ @@ -870,7 +870,7 @@ recheck: MTMSRD(r10) /* disable interrupts */ rlwinm r9,r1,0,0,18 lwz r9,TI_FLAGS(r9) - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne- do_resched andi. r0,r9,_TIF_SIGPENDING beq restore_user Index: linux/arch/ppc/kernel/semaphore.c =================================================================== --- linux.orig/arch/ppc/kernel/semaphore.c +++ linux/arch/ppc/kernel/semaphore.c @@ -29,7 +29,7 @@ * sem->count = tmp; * return old_count; */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -48,7 +48,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -70,7 +70,7 @@ void __up(struct semaphore *sem) * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -100,7 +100,7 @@ void __sched __down(struct semaphore *se wake_up(&sem->wait); } -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -129,3 +129,8 @@ int __sched __down_interruptible(struct wake_up(&sem->wait); return retval; } + +int compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} Index: linux/arch/ppc/kernel/smp.c =================================================================== --- linux.orig/arch/ppc/kernel/smp.c +++ linux/arch/ppc/kernel/smp.c @@ -137,6 +137,16 @@ void smp_send_reschedule(int cpu) smp_message_pass(cpu, PPC_MSG_RESCHEDULE); } +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + smp_message_pass(MSG_ALL_BUT_SELF, PPC_MSG_RESCHEDULE, 0, 0); +} + #ifdef CONFIG_XMON void smp_send_xmon_break(int cpu) { @@ -161,7 +171,7 @@ void smp_send_stop(void) * static memory requirements. It also looks cleaner. * Stolen from the i386 version. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); static struct call_data_struct { void (*func) (void *info); Index: linux/arch/ppc/kernel/time.c =================================================================== --- linux.orig/arch/ppc/kernel/time.c +++ linux/arch/ppc/kernel/time.c @@ -66,6 +66,9 @@ #include +unsigned long cpu_khz; /* Detected as we calibrate the TSC */ +EXPORT_SYMBOL(cpu_khz); + unsigned long disarm_decr[NR_CPUS]; extern struct timezone sys_tz; @@ -102,7 +105,7 @@ static inline int tb_delta(unsigned *jif } #ifdef CONFIG_SMP -unsigned long profile_pc(struct pt_regs *regs) +unsigned long notrace profile_pc(struct pt_regs *regs) { unsigned long pc = instruction_pointer(regs); Index: linux/arch/ppc/kernel/traps.c =================================================================== --- linux.orig/arch/ppc/kernel/traps.c +++ linux/arch/ppc/kernel/traps.c @@ -71,7 +71,7 @@ void (*debugger_fault_handler)(struct pt * Trap & Exception support */ -DEFINE_SPINLOCK(die_lock); +DEFINE_RAW_SPINLOCK(die_lock); int die(const char * str, struct pt_regs * fp, long err) { @@ -106,6 +106,10 @@ void _exception(int signr, struct pt_reg debugger(regs); die("Exception in kernel mode", regs, signr); } +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif info.si_signo = signr; info.si_errno = 0; info.si_code = code; Index: linux/arch/ppc/lib/locks.c =================================================================== --- linux.orig/arch/ppc/lib/locks.c +++ linux/arch/ppc/lib/locks.c @@ -42,7 +42,7 @@ static inline unsigned long __spin_trylo return ret; } -void _raw_spin_lock(spinlock_t *lock) +void __raw_spin_lock(raw_spinlock_t *lock) { int cpu = smp_processor_id(); unsigned int stuck = INIT_STUCK; @@ -62,9 +62,9 @@ void _raw_spin_lock(spinlock_t *lock) lock->owner_pc = (unsigned long)__builtin_return_address(0); lock->owner_cpu = cpu; } -EXPORT_SYMBOL(_raw_spin_lock); +EXPORT_SYMBOL(__raw_spin_lock); -int _raw_spin_trylock(spinlock_t *lock) +int __raw_spin_trylock(raw_spinlock_t *lock) { if (__spin_trylock(&lock->lock)) return 0; @@ -72,9 +72,9 @@ int _raw_spin_trylock(spinlock_t *lock) lock->owner_pc = (unsigned long)__builtin_return_address(0); return 1; } -EXPORT_SYMBOL(_raw_spin_trylock); +EXPORT_SYMBOL(__raw_spin_trylock); -void _raw_spin_unlock(spinlock_t *lp) +void __raw_spin_unlock(raw_spinlock_t *lp) { if ( !lp->lock ) printk("_spin_unlock(%p): no lock cpu %d curr PC %p %s/%d\n", @@ -88,13 +88,13 @@ void _raw_spin_unlock(spinlock_t *lp) wmb(); lp->lock = 0; } -EXPORT_SYMBOL(_raw_spin_unlock); +EXPORT_SYMBOL(__raw_spin_unlock); /* * For rwlocks, zero is unlocked, -1 is write-locked, * positive is read-locked. */ -static __inline__ int __read_trylock(rwlock_t *rw) +static __inline__ int __read_trylock(raw_rwlock_t *rw) { signed int tmp; @@ -114,13 +114,13 @@ static __inline__ int __read_trylock(rwl return tmp; } -int _raw_read_trylock(rwlock_t *rw) +int __raw_read_trylock(raw_rwlock_t *rw) { return __read_trylock(rw) > 0; } -EXPORT_SYMBOL(_raw_read_trylock); +EXPORT_SYMBOL(__raw_read_trylock); -void _raw_read_lock(rwlock_t *rw) +void __raw_read_lock(rwlock_t *rw) { unsigned int stuck; @@ -135,9 +135,9 @@ void _raw_read_lock(rwlock_t *rw) } } } -EXPORT_SYMBOL(_raw_read_lock); +EXPORT_SYMBOL(__raw_read_lock); -void _raw_read_unlock(rwlock_t *rw) +void __raw_read_unlock(raw_rwlock_t *rw) { if ( rw->lock == 0 ) printk("_read_unlock(): %s/%d (nip %08lX) lock %d\n", @@ -146,9 +146,9 @@ void _raw_read_unlock(rwlock_t *rw) wmb(); atomic_dec((atomic_t *) &(rw)->lock); } -EXPORT_SYMBOL(_raw_read_unlock); +EXPORT_SYMBOL(__raw_read_unlock); -void _raw_write_lock(rwlock_t *rw) +void __raw_write_lock(raw_rwlock_t *rw) { unsigned int stuck; @@ -164,18 +164,18 @@ void _raw_write_lock(rwlock_t *rw) } wmb(); } -EXPORT_SYMBOL(_raw_write_lock); +EXPORT_SYMBOL(__raw_write_lock); -int _raw_write_trylock(rwlock_t *rw) +int __raw_write_trylock(raw_rwlock_t *rw) { if (cmpxchg(&rw->lock, 0, -1) != 0) return 0; wmb(); return 1; } -EXPORT_SYMBOL(_raw_write_trylock); +EXPORT_SYMBOL(__raw_write_trylock); -void _raw_write_unlock(rwlock_t *rw) +void __raw_write_unlock(raw_rwlock_t *rw) { if (rw->lock >= 0) printk("_write_lock(): %s/%d (nip %08lX) lock %d\n", @@ -184,6 +184,6 @@ void _raw_write_unlock(rwlock_t *rw) wmb(); rw->lock = 0; } -EXPORT_SYMBOL(_raw_write_unlock); +EXPORT_SYMBOL(__raw_write_unlock); #endif Index: linux/arch/ppc/mm/fault.c =================================================================== --- linux.orig/arch/ppc/mm/fault.c +++ linux/arch/ppc/mm/fault.c @@ -89,7 +89,7 @@ static int store_updates_sp(struct pt_re * the error_code parameter is ESR for a data fault, 0 for an instruction * fault. */ -int do_page_fault(struct pt_regs *regs, unsigned long address, +int notrace do_page_fault(struct pt_regs *regs, unsigned long address, unsigned long error_code) { struct vm_area_struct * vma; Index: linux/arch/ppc/mm/init.c =================================================================== --- linux.orig/arch/ppc/mm/init.c +++ linux/arch/ppc/mm/init.c @@ -55,7 +55,7 @@ #endif #define MAX_LOW_MEM CONFIG_LOWMEM_SIZE -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long total_memory; unsigned long total_lowmem; Index: linux/arch/ppc/platforms/apus_setup.c =================================================================== --- linux.orig/arch/ppc/platforms/apus_setup.c +++ linux/arch/ppc/platforms/apus_setup.c @@ -275,6 +275,7 @@ void apus_calibrate_decr(void) freq/1000000, freq%1000000); tb_ticks_per_jiffy = freq / HZ; tb_to_us = mulhwu_scale_factor(freq, 1000000); + cpu_khz = freq / 1000; __bus_speed = bus_speed; __speed_test_failed = speed_test_failed; Index: linux/arch/ppc/platforms/ev64260.c =================================================================== --- linux.orig/arch/ppc/platforms/ev64260.c +++ linux/arch/ppc/platforms/ev64260.c @@ -550,6 +550,7 @@ ev64260_calibrate_decr(void) tb_ticks_per_jiffy = freq / HZ; tb_to_us = mulhwu_scale_factor(freq, 1000000); + cpu_khz = freq / 1000; return; } Index: linux/arch/ppc/platforms/gemini_setup.c =================================================================== --- linux.orig/arch/ppc/platforms/gemini_setup.c +++ linux/arch/ppc/platforms/gemini_setup.c @@ -459,6 +459,7 @@ void __init gemini_calibrate_decr(void) divisor = 4; tb_ticks_per_jiffy = freq / HZ / divisor; tb_to_us = mulhwu_scale_factor(freq/divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; } unsigned long __init gemini_find_end_of_memory(void) Index: linux/arch/ppc/platforms/hdpu.c =================================================================== --- linux.orig/arch/ppc/platforms/hdpu.c +++ linux/arch/ppc/platforms/hdpu.c @@ -55,7 +55,7 @@ static void parse_bootinfo(unsigned long static void hdpu_set_l1pe(void); static void hdpu_cpustate_set(unsigned char new_state); #ifdef CONFIG_SMP -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned int timebase_upper = 0, timebase_lower = 0; extern int smp_tb_synchronized; Index: linux/arch/ppc/platforms/powerpmc250.c =================================================================== --- linux.orig/arch/ppc/platforms/powerpmc250.c +++ linux/arch/ppc/platforms/powerpmc250.c @@ -163,6 +163,7 @@ powerpmc250_calibrate_decr(void) tb_ticks_per_jiffy = freq / (HZ * divisor); tb_to_us = mulhwu_scale_factor(freq/divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; } static void Index: linux/arch/ppc/platforms/prep_setup.c =================================================================== --- linux.orig/arch/ppc/platforms/prep_setup.c +++ linux/arch/ppc/platforms/prep_setup.c @@ -940,6 +940,7 @@ prep_calibrate_decr(void) (freq/divisor)/1000000, (freq/divisor)%1000000); tb_to_us = mulhwu_scale_factor(freq/divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; tb_ticks_per_jiffy = freq / HZ / divisor; } } Index: linux/arch/ppc/platforms/prpmc750.c =================================================================== --- linux.orig/arch/ppc/platforms/prpmc750.c +++ linux/arch/ppc/platforms/prpmc750.c @@ -268,6 +268,7 @@ static void __init prpmc750_calibrate_de tb_ticks_per_jiffy = freq / (HZ * divisor); tb_to_us = mulhwu_scale_factor(freq / divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; } static void prpmc750_restart(char *cmd) Index: linux/arch/ppc/platforms/prpmc800.c =================================================================== --- linux.orig/arch/ppc/platforms/prpmc800.c +++ linux/arch/ppc/platforms/prpmc800.c @@ -327,6 +327,7 @@ static void __init prpmc800_calibrate_de tb_ticks_per_second = 100000000 / 4; tb_ticks_per_jiffy = tb_ticks_per_second / HZ; tb_to_us = mulhwu_scale_factor(tb_ticks_per_second, 1000000); + cpu_khz = tb_ticks_per_second / 1000; return; } @@ -367,6 +368,7 @@ static void __init prpmc800_calibrate_de tb_ticks_per_second = (tbl_end - tbl_start) * 2; tb_ticks_per_jiffy = tb_ticks_per_second / HZ; tb_to_us = mulhwu_scale_factor(tb_ticks_per_second, 1000000); + cpu_khz = tb_ticks_per_second / 1000; } static void prpmc800_restart(char *cmd) Index: linux/arch/ppc/platforms/sbc82xx.c =================================================================== --- linux.orig/arch/ppc/platforms/sbc82xx.c +++ linux/arch/ppc/platforms/sbc82xx.c @@ -65,7 +65,7 @@ static void sbc82xx_time_init(void) static volatile char *sbc82xx_i8259_map; static char sbc82xx_i8259_mask = 0xff; -static DEFINE_SPINLOCK(sbc82xx_i8259_lock); +static DEFINE_RAW_SPINLOCK(sbc82xx_i8259_lock); static void sbc82xx_i8259_mask_and_ack_irq(unsigned int irq_nr) { Index: linux/arch/ppc/platforms/spruce.c =================================================================== --- linux.orig/arch/ppc/platforms/spruce.c +++ linux/arch/ppc/platforms/spruce.c @@ -147,6 +147,7 @@ spruce_calibrate_decr(void) freq = SPRUCE_BUS_SPEED; tb_ticks_per_jiffy = freq / HZ / divisor; tb_to_us = mulhwu_scale_factor(freq/divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; } static int Index: linux/arch/ppc/syslib/cpm2_common.c =================================================================== --- linux.orig/arch/ppc/syslib/cpm2_common.c +++ linux/arch/ppc/syslib/cpm2_common.c @@ -114,7 +114,7 @@ cpm2_fastbrg(uint brg, uint rate, int di /* * dpalloc / dpfree bits. */ -static spinlock_t cpm_dpmem_lock; +static raw_spinlock_t cpm_dpmem_lock; /* 16 blocks should be enough to satisfy all requests * until the memory subsystem goes up... */ static rh_block_t cpm_boot_dpmem_rh_block[16]; Index: linux/arch/ppc/syslib/ibm44x_common.c =================================================================== --- linux.orig/arch/ppc/syslib/ibm44x_common.c +++ linux/arch/ppc/syslib/ibm44x_common.c @@ -63,6 +63,7 @@ void __init ibm44x_calibrate_decr(unsign { tb_ticks_per_jiffy = freq / HZ; tb_to_us = mulhwu_scale_factor(freq, 1000000); + cpu_khz = freq / 1000; /* Set the time base to zero */ mtspr(SPRN_TBWL, 0); Index: linux/arch/ppc/syslib/m8260_setup.c =================================================================== --- linux.orig/arch/ppc/syslib/m8260_setup.c +++ linux/arch/ppc/syslib/m8260_setup.c @@ -79,6 +79,7 @@ m8260_calibrate_decr(void) divisor = 4; tb_ticks_per_jiffy = freq / HZ / divisor; tb_to_us = mulhwu_scale_factor(freq / divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; } /* The 8260 has an internal 1-second timer update register that Index: linux/arch/ppc/syslib/m8xx_setup.c =================================================================== --- linux.orig/arch/ppc/syslib/m8xx_setup.c +++ linux/arch/ppc/syslib/m8xx_setup.c @@ -218,6 +218,7 @@ void __init m8xx_calibrate_decr(void) printk("Decrementer Frequency = %d/%d\n", freq, divisor); tb_ticks_per_jiffy = freq / HZ / divisor; tb_to_us = mulhwu_scale_factor(freq / divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; /* Perform some more timer/timebase initialization. This used * to be done elsewhere, but other changes caused it to get Index: linux/arch/ppc/syslib/mpc52xx_setup.c =================================================================== --- linux.orig/arch/ppc/syslib/mpc52xx_setup.c +++ linux/arch/ppc/syslib/mpc52xx_setup.c @@ -215,6 +215,7 @@ mpc52xx_calibrate_decr(void) tb_ticks_per_jiffy = xlbfreq / HZ / divisor; tb_to_us = mulhwu_scale_factor(xlbfreq / divisor, 1000000); + cpu_khz = (xlbfreq / divisor) / 1000; } Index: linux/arch/ppc/syslib/ocp.c =================================================================== --- linux.orig/arch/ppc/syslib/ocp.c +++ linux/arch/ppc/syslib/ocp.c @@ -44,11 +44,11 @@ #include #include #include +#include #include #include #include -#include #include //#define DBG(x) printk x Index: linux/arch/ppc/syslib/open_pic.c =================================================================== --- linux.orig/arch/ppc/syslib/open_pic.c +++ linux/arch/ppc/syslib/open_pic.c @@ -526,7 +526,7 @@ void openpic_reset_processor_phys(u_int } #if defined(CONFIG_SMP) || defined(CONFIG_PM) -static DEFINE_SPINLOCK(openpic_setup_lock); +static DEFINE_RAW_SPINLOCK(openpic_setup_lock); #endif #ifdef CONFIG_SMP Index: linux/arch/ppc/syslib/open_pic2.c =================================================================== --- linux.orig/arch/ppc/syslib/open_pic2.c +++ linux/arch/ppc/syslib/open_pic2.c @@ -380,7 +380,7 @@ static void openpic2_set_spurious(u_int vec); } -static DEFINE_SPINLOCK(openpic2_setup_lock); +static DEFINE_RAW_SPINLOCK(openpic2_setup_lock); /* * Initialize a timer interrupt (and disable it) Index: linux/arch/ppc/syslib/ppc4xx_setup.c =================================================================== --- linux.orig/arch/ppc/syslib/ppc4xx_setup.c +++ linux/arch/ppc/syslib/ppc4xx_setup.c @@ -172,6 +172,7 @@ ppc4xx_calibrate_decr(void) freq = bip->bi_tbfreq; tb_ticks_per_jiffy = freq / HZ; tb_to_us = mulhwu_scale_factor(freq, 1000000); + cpu_khz = freq / 1000; /* Set the time base to zero. ** At 200 Mhz, time base will rollover in ~2925 years. Index: linux/arch/ppc/syslib/ppc85xx_setup.c =================================================================== --- linux.orig/arch/ppc/syslib/ppc85xx_setup.c +++ linux/arch/ppc/syslib/ppc85xx_setup.c @@ -57,6 +57,7 @@ mpc85xx_calibrate_decr(void) divisor = 8; tb_ticks_per_jiffy = freq / divisor / HZ; tb_to_us = mulhwu_scale_factor(freq / divisor, 1000000); + cpu_khz = (freq / divisor) / 1000; /* Set the time base to zero */ mtspr(SPRN_TBWL, 0); Index: linux/arch/ppc/syslib/todc_time.c =================================================================== --- linux.orig/arch/ppc/syslib/todc_time.c +++ linux/arch/ppc/syslib/todc_time.c @@ -506,6 +506,7 @@ todc_calibrate_decr(void) tb_ticks_per_jiffy = freq / HZ; tb_to_us = mulhwu_scale_factor(freq, 1000000); + cpu_khz = freq / 1000; return; } Index: linux/arch/s390/appldata/appldata_base.c =================================================================== --- linux.orig/arch/s390/appldata/appldata_base.c +++ linux/arch/s390/appldata/appldata_base.c @@ -561,7 +561,6 @@ appldata_offline_cpu(int cpu) spin_unlock(&appldata_timer_lock); } -#ifdef CONFIG_HOTPLUG_CPU static int __cpuinit appldata_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) @@ -582,7 +581,6 @@ appldata_cpu_notify(struct notifier_bloc static struct notifier_block appldata_nb = { .notifier_call = appldata_cpu_notify, }; -#endif /* * appldata_init() Index: linux/arch/sparc64/Kconfig =================================================================== --- linux.orig/arch/sparc64/Kconfig +++ linux/arch/sparc64/Kconfig @@ -26,7 +26,7 @@ config MMU bool default y -config TIME_INTERPOLATION +config GENERIC_TIME bool default y Index: linux/arch/sparc64/defconfig =================================================================== --- linux.orig/arch/sparc64/defconfig +++ linux/arch/sparc64/defconfig @@ -7,7 +7,7 @@ CONFIG_SPARC=y CONFIG_SPARC64=y CONFIG_64BIT=y CONFIG_MMU=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_AUDIT_ARCH=y CONFIG_SPARC64_PAGE_SIZE_8KB=y Index: linux/arch/sparc64/kernel/time.c =================================================================== --- linux.orig/arch/sparc64/kernel/time.c +++ linux/arch/sparc64/kernel/time.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -620,7 +621,7 @@ static void __init set_system_time(void) if (!mregs && !dregs) { prom_printf("Something wrong, clock regs not mapped yet.\n"); prom_halt(); - } + } if (mregs) { spin_lock_irq(&mostek_lock); @@ -820,7 +821,7 @@ static int __devinit clock_probe(struct } set_system_time(); - + local_irq_restore(flags); return 0; @@ -975,22 +976,33 @@ static struct notifier_block sparc64_cpu #endif /* CONFIG_CPU_FREQ */ -static struct time_interpolator sparc64_cpu_interpolator = { - .source = TIME_SOURCE_CPU, - .shift = 16, - .mask = 0xffffffffffffffffLL +static cycle_t read_itc(void) +{ + return (cycle_t)get_cycles()); +} + +static struct clocksource clocksource_sparc64_itc = { + .name = "sparc64_itc", + .rating = 300, + .read = read_itc, + .mask = 0xffffffffffffffffLL, + .mult = 0, /*to be caluclated*/ + .shift = 16, + .is_continuous = 1, }; + /* The quotient formula is taken from the IA64 port. */ #define SPARC64_NSEC_PER_CYC_SHIFT 10UL void __init time_init(void) { unsigned long clock = sparc64_init_timers(); - sparc64_cpu_interpolator.frequency = clock; - register_time_interpolator(&sparc64_cpu_interpolator); + clocksource_sparc64_itc.mult = clocksource_hz2mult(clock, + clocksource_sparc64_itc.shift); + clocksource_register(&clocksource_sparc64_itc); - /* Now that the interpolator is registered, it is + /* Now that the clocksource is registered, it is * safe to start the timer ticking. */ sparc64_start_timers(); @@ -1025,11 +1037,11 @@ static int set_rtc_mmss(unsigned long no unsigned long flags; u8 tmp; - /* + /* * Not having a register set can lead to trouble. * Also starfire doesn't have a tod clock. */ - if (!mregs && !dregs) + if (!mregs && !dregs) return -1; if (mregs) { Index: linux/arch/x86_64/Kconfig =================================================================== --- linux.orig/arch/x86_64/Kconfig +++ linux/arch/x86_64/Kconfig @@ -28,6 +28,18 @@ config ZONE_DMA32 bool default y +config GENERIC_TIME + bool + default y + +config GENERIC_CLOCKEVENTS + bool + default y + +config GENERIC_TIME_VSYSCALL + bool + default y + config LOCKDEP_SUPPORT bool default y @@ -50,13 +62,6 @@ config ISA config SBUS bool -config RWSEM_GENERIC_SPINLOCK - bool - default y - -config RWSEM_XCHGADD_ALGORITHM - bool - config GENERIC_HWEIGHT bool default y @@ -290,6 +295,8 @@ config SCHED_MC making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here. +source "kernel/time/Kconfig" + source "kernel/Kconfig.preempt" config NUMA @@ -303,6 +310,14 @@ config NUMA If the system is EM64T, you should say N unless your system is EM64T NUMA. +config RWSEM_GENERIC_SPINLOCK + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + depends on !RWSEM_GENERIC_SPINLOCK && !PREEMPT_RT + bool + config K8_NUMA bool "Old style AMD Opteron NUMA detection" depends on NUMA && PCI Index: linux/arch/x86_64/ia32/ia32entry.S =================================================================== --- linux.orig/arch/x86_64/ia32/ia32entry.S +++ linux/arch/x86_64/ia32/ia32entry.S @@ -120,7 +120,9 @@ sysenter_do_call: cmpl $(IA32_NR_syscalls-1),%eax ja ia32_badsys IA32_ARG_FIXUP 1 + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) GET_THREAD_INFO(%r10) cli @@ -229,7 +231,9 @@ cstar_do_call: cmpl $IA32_NR_syscalls-1,%eax ja ia32_badsys IA32_ARG_FIXUP 1 + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) GET_THREAD_INFO(%r10) cli @@ -323,8 +327,10 @@ ia32_do_syscall: cmpl $(IA32_NR_syscalls-1),%eax ja ia32_badsys IA32_ARG_FIXUP + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) # xxx: rip relative ia32_sysret: + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) jmp int_ret_from_sys_call @@ -394,7 +400,7 @@ END(ia32_ptregs_common) .section .rodata,"a" .align 8 -ia32_sys_call_table: +ENTRY(ia32_sys_call_table) .quad sys_restart_syscall .quad sys_exit .quad stub32_fork @@ -718,4 +724,7 @@ ia32_sys_call_table: .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu +#ifdef CONFIG_EVENT_TRACE + .globl ia32_syscall_end +#endif ia32_syscall_end: Index: linux/arch/x86_64/kernel/Makefile =================================================================== --- linux.orig/arch/x86_64/kernel/Makefile +++ linux/arch/x86_64/kernel/Makefile @@ -8,7 +8,7 @@ obj-y := process.o signal.o entry.o trap ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \ x8664_ksyms.o i387.o syscall.o vsyscall.o \ setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \ - pci-dma.o pci-nommu.o alternative.o + pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-$(CONFIG_X86_MCE) += mce.o therm_throt.o @@ -19,10 +19,9 @@ obj-$(CONFIG_ACPI) += acpi/ obj-$(CONFIG_X86_MSR) += msr.o obj-$(CONFIG_MICROCODE) += microcode.o obj-$(CONFIG_X86_CPUID) += cpuid.o -obj-$(CONFIG_SMP) += smp.o smpboot.o trampoline.o +obj-$(CONFIG_SMP) += smp.o smpboot.o trampoline.o tsc_sync.o obj-y += apic.o nmi.o -obj-y += io_apic.o mpparse.o \ - genapic.o genapic_cluster.o genapic_flat.o +obj-y += io_apic.o mpparse.o genapic.o genapic_flat.o obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o crash.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_PM) += suspend.o Index: linux/arch/x86_64/kernel/apic.c =================================================================== --- linux.orig/arch/x86_64/kernel/apic.c +++ linux/arch/x86_64/kernel/apic.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -37,10 +38,10 @@ #include #include #include +#include int apic_mapped; int apic_verbosity; -int apic_runs_main_timer; int apic_calibrate_pmtmr __initdata; int disable_apic_timer __initdata; @@ -54,6 +55,27 @@ static cpumask_t timer_interrupt_broadca /* Using APIC to generate smp_local_timer_interrupt? */ int using_apic_timer __read_mostly = 0; + +static unsigned int calibration_result; + +static void lapic_next_event(unsigned long delta, + struct clock_event_device *evt); +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt); + +static struct clock_event_device lapic_clockevent = { + .name = "lapic", + .capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE +#ifdef CONFIG_SMP + | CLOCK_CAP_UPDATE +#endif + , + .shift = 32, + .set_mode = lapic_timer_setup, + .set_next_event = lapic_next_event, +}; +static DEFINE_PER_CPU(struct clock_event_device, lapic_events); + static void apic_pm_activate(void); void enable_NMI_through_LVT0 (void * dummy) @@ -452,23 +474,31 @@ static struct { static int lapic_suspend(struct sys_device *dev, pm_message_t state) { unsigned long flags; + int maxlvt; if (!apic_pm_state.active) return 0; + maxlvt = get_maxlvt(); + apic_pm_state.apic_id = apic_read(APIC_ID); apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); apic_pm_state.apic_ldr = apic_read(APIC_LDR); apic_pm_state.apic_dfr = apic_read(APIC_DFR); apic_pm_state.apic_spiv = apic_read(APIC_SPIV); apic_pm_state.apic_lvtt = apic_read(APIC_LVTT); - apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); + if (maxlvt >= 4) + apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); apic_pm_state.apic_tmict = apic_read(APIC_TMICT); apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); - apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); +#ifdef CONFIG_X86_MCE_P4THERMAL + if (maxlvt >= 5) + apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); +#endif + local_irq_save(flags); disable_local_APIC(); local_irq_restore(flags); @@ -479,15 +509,20 @@ static int lapic_resume(struct sys_devic { unsigned int l, h; unsigned long flags; + int maxlvt; if (!apic_pm_state.active) return 0; + maxlvt = get_maxlvt(); + local_irq_save(flags); + rdmsr(MSR_IA32_APICBASE, l, h); l &= ~MSR_IA32_APICBASE_BASE; l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr; wrmsr(MSR_IA32_APICBASE, l, h); + apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED); apic_write(APIC_ID, apic_pm_state.apic_id); apic_write(APIC_DFR, apic_pm_state.apic_dfr); @@ -496,8 +531,12 @@ static int lapic_resume(struct sys_devic apic_write(APIC_SPIV, apic_pm_state.apic_spiv); apic_write(APIC_LVT0, apic_pm_state.apic_lvt0); apic_write(APIC_LVT1, apic_pm_state.apic_lvt1); - apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); - apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); +#ifdef CONFIG_X86_MCE_P4THERMAL + if (maxlvt >= 5) + apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); +#endif + if (maxlvt >= 4) + apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); apic_write(APIC_TMICT, apic_pm_state.apic_tmict); @@ -642,13 +681,16 @@ void __init init_apic_mappings(void) #define APIC_DIVISOR 16 -static void __setup_APIC_LVTT(unsigned int clocks) +static void __setup_APIC_LVTT(unsigned int clocks, int oneshot) { unsigned int lvtt_value, tmp_value, ver; int cpu = smp_processor_id(); ver = GET_APIC_VERSION(apic_read(APIC_LVR)); - lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR; + lvtt_value = LOCAL_TIMER_VECTOR; + if (!oneshot) + lvtt_value |= APIC_LVT_TIMER_PERIODIC; + if (cpu_isset(cpu, timer_interrupt_broadcast_ipi_mask)) lvtt_value |= APIC_LVT_MASKED; @@ -663,48 +705,36 @@ static void __setup_APIC_LVTT(unsigned i & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) | APIC_TDR_DIV_16); - apic_write(APIC_TMICT, clocks/APIC_DIVISOR); + if (!oneshot) + apic_write(APIC_TMICT, clocks/APIC_DIVISOR); } -static void setup_APIC_timer(unsigned int clocks) +static void lapic_next_event(unsigned long delta, + struct clock_event_device *evt) +{ + apic_write(APIC_TMICT, delta); +} + +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt) { unsigned long flags; local_irq_save(flags); - - /* wait for irq slice */ - if (vxtime.hpet_address && hpet_use_timer) { - int trigger = hpet_readl(HPET_T0_CMP); - while (hpet_readl(HPET_COUNTER) >= trigger) - /* do nothing */ ; - while (hpet_readl(HPET_COUNTER) < trigger) - /* do nothing */ ; - } else { - int c1, c2; - outb_p(0x00, 0x43); - c2 = inb_p(0x40); - c2 |= inb_p(0x40) << 8; - do { - c1 = c2; - outb_p(0x00, 0x43); - c2 = inb_p(0x40); - c2 |= inb_p(0x40) << 8; - } while (c2 - c1 < 300); - } - __setup_APIC_LVTT(clocks); - /* Turn off PIT interrupt if we use APIC timer as main timer. - Only works with the PM timer right now - TBD fix it for HPET too. */ - if (vxtime.mode == VXTIME_PMTMR && - smp_processor_id() == boot_cpu_id && - apic_runs_main_timer == 1 && - !cpu_isset(boot_cpu_id, timer_interrupt_broadcast_ipi_mask)) { - stop_timer_interrupt(); - apic_runs_main_timer++; - } + __setup_APIC_LVTT(calibration_result, mode != CLOCK_EVT_PERIODIC); local_irq_restore(flags); } + +static void __devinit setup_APIC_timer(void) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + + memcpy(levt, &lapic_clockevent, sizeof(*levt)); + + register_local_clockevent(levt); +} + /* * In this function we calibrate APIC bus clocks to the external * timer. Unfortunately we cannot use jiffies and the timer irq @@ -724,12 +754,13 @@ static int __init calibrate_APIC_clock(v { int apic, apic_start, tsc, tsc_start; int result; + u64 wallclock_nsecs; /* * Put whatever arbitrary (but long enough) timeout * value into the APIC clock, we just want to get the * counter running for calibration. */ - __setup_APIC_LVTT(1000000000); + __setup_APIC_LVTT(1000000000, 0); apic_start = apic_read(APIC_TMCCT); #ifdef CONFIG_X86_PM_TIMER @@ -737,6 +768,8 @@ static int __init calibrate_APIC_clock(v pmtimer_wait(5000); /* 5ms wait */ apic = apic_read(APIC_TMCCT); result = (apic_start - apic) * 1000L / 5; + printk("using pmtimer for lapic calibration\n"); + wallclock_nsecs = 5000000; } else #endif { @@ -750,6 +783,8 @@ static int __init calibrate_APIC_clock(v result = (apic_start - apic) * 1000L * cpu_khz / (tsc - tsc_start); + wallclock_nsecs = ((u64)tsc - (u64)tsc_start) * 1000000 / (u64)cpu_khz; + } printk("result %d\n", result); @@ -757,11 +792,22 @@ static int __init calibrate_APIC_clock(v printk(KERN_INFO "Detected %d.%03d MHz APIC timer.\n", result / 1000 / 1000, result / 1000 % 1000); + + + + /* Calculate the scaled math multiplication factor */ + lapic_clockevent.mult = div_sc(apic_start - apic, wallclock_nsecs, 32); + + lapic_clockevent.max_delta_ns = + clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); + printk("lapic max_delta_ns: %ld\n", lapic_clockevent.max_delta_ns); + lapic_clockevent.min_delta_ns = + clockevent_delta2ns(0xF, &lapic_clockevent); + + return result * APIC_DIVISOR / HZ; } -static unsigned int calibration_result; - void __init setup_boot_APIC_clock (void) { if (disable_apic_timer) { @@ -778,7 +824,7 @@ void __init setup_boot_APIC_clock (void) /* * Now set up the timer for real. */ - setup_APIC_timer(calibration_result); + setup_APIC_timer(); local_irq_enable(); } @@ -786,7 +832,7 @@ void __init setup_boot_APIC_clock (void) void __cpuinit setup_secondary_APIC_clock(void) { local_irq_disable(); /* FIXME: Do we need this? --RR */ - setup_APIC_timer(calibration_result); + setup_APIC_timer(); local_irq_enable(); } @@ -833,6 +879,13 @@ void switch_APIC_timer_to_ipi(void *cpum !cpu_isset(cpu, timer_interrupt_broadcast_ipi_mask)) { disable_APIC_timer(); cpu_set(cpu, timer_interrupt_broadcast_ipi_mask); +#ifdef CONFIG_HIGH_RES_TIMERS + printk("Disabling NO_HZ and high resolution timers " + "due to timer broadcasting\n"); + for_each_possible_cpu(cpu) + per_cpu(lapic_events, cpu).capabilities &= + ~CLOCK_CAP_NEXTEVT; +#endif } } EXPORT_SYMBOL(switch_APIC_timer_to_ipi); @@ -887,12 +940,10 @@ void setup_APIC_extened_lvt(unsigned cha void smp_local_timer_interrupt(void) { - profile_tick(CPU_PROFILING); +// profile_tick(CPU_PROFILING); #ifdef CONFIG_SMP update_process_times(user_mode(get_irq_regs())); #endif - if (apic_runs_main_timer > 1 && smp_processor_id() == boot_cpu_id) - main_timer_handler(); /* * We take the 'long' return path, and there every subsystem * grabs the appropriate locks (kernel lock/ irq lock). @@ -917,6 +968,8 @@ void smp_apic_timer_interrupt(struct pt_ { struct pt_regs *old_regs = set_irq_regs(regs); + int cpu = smp_processor_id(); + struct clock_event_device *evt = &per_cpu(lapic_events, cpu); /* * the NMI deadlock-detector uses this. */ @@ -934,7 +987,7 @@ void smp_apic_timer_interrupt(struct pt_ */ exit_idle(); irq_enter(); - smp_local_timer_interrupt(); + evt->event_handler(regs); irq_exit(); set_irq_regs(old_regs); } @@ -1109,26 +1162,11 @@ static __init int setup_noapictimer(char return 1; } -static __init int setup_apicmaintimer(char *str) -{ - apic_runs_main_timer = 1; - nohpet = 1; - return 1; -} -__setup("apicmaintimer", setup_apicmaintimer); - -static __init int setup_noapicmaintimer(char *str) -{ - apic_runs_main_timer = -1; - return 1; -} -__setup("noapicmaintimer", setup_noapicmaintimer); - static __init int setup_apicpmtimer(char *s) { apic_calibrate_pmtmr = 1; notsc_setup(NULL); - return setup_apicmaintimer(NULL); + return 1; } __setup("apicpmtimer", setup_apicpmtimer); Index: linux/arch/x86_64/kernel/early_printk.c =================================================================== --- linux.orig/arch/x86_64/kernel/early_printk.c +++ linux/arch/x86_64/kernel/early_printk.c @@ -203,7 +203,7 @@ static int early_console_initialized = 0 void early_printk(const char *fmt, ...) { - char buf[512]; + static char buf[512]; int n; va_list ap; Index: linux/arch/x86_64/kernel/entry.S =================================================================== --- linux.orig/arch/x86_64/kernel/entry.S +++ linux/arch/x86_64/kernel/entry.S @@ -53,6 +53,47 @@ .code64 +#ifdef CONFIG_EVENT_TRACE + +ENTRY(mcount) + cmpl $0, mcount_enabled + jz out + + push %rbp + mov %rsp,%rbp + + push %r11 + push %r10 + push %r9 + push %r8 + push %rdi + push %rsi + push %rdx + push %rcx + push %rax + + mov 0x0(%rbp),%rax + mov 0x8(%rbp),%rdi + mov 0x8(%rax),%rsi + + call __trace + + pop %rax + pop %rcx + pop %rdx + pop %rsi + pop %rdi + pop %r8 + pop %r9 + pop %r10 + pop %r11 + + pop %rbp +out: + ret + +#endif + #ifndef CONFIG_PREEMPT #define retint_kernel retint_restore_args #endif @@ -235,7 +276,9 @@ ENTRY(system_call) cmpq $__NR_syscall_max,%rax ja badsys movq %r10,%rcx + TRACE_SYS_CALL call *sys_call_table(,%rax,8) # XXX: rip relative + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) /* * Syscall return path ending with SYSRET (fast path) @@ -269,8 +312,8 @@ sysret_check: /* edx: work, edi: workmask */ sysret_careful: CFI_RESTORE_STATE - bt $TIF_NEED_RESCHED,%edx - jnc sysret_signal + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz sysret_signal TRACE_IRQS_ON sti pushq %rdi @@ -293,7 +336,7 @@ sysret_signal: leaq -ARGOFFSET(%rsp),%rdi # &pt_regs -> arg1 xorl %esi,%esi # oldset -> arg2 call ptregscall_common -1: movl $_TIF_NEED_RESCHED,%edi +1: movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi /* Use IRET because user could have changed frame. This works because ptregscall_common has called FIXUP_TOP_OF_STACK. */ cli @@ -319,7 +362,9 @@ tracesys: cmova %rcx,%rax ja 1f movq %r10,%rcx /* fixup for C */ + TRACE_SYS_CALL call *sys_call_table(,%rax,8) + TRACE_SYS_RET 1: movq %rax,RAX-ARGOFFSET(%rsp) /* Use IRET because user could have changed frame */ jmp int_ret_from_sys_call @@ -366,8 +411,8 @@ int_with_check: /* First do a reschedule test. */ /* edx: work, edi: workmask */ int_careful: - bt $TIF_NEED_RESCHED,%edx - jnc int_very_careful + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz int_very_careful TRACE_IRQS_ON sti pushq %rdi @@ -404,7 +449,7 @@ int_signal: movq %rsp,%rdi # &ptregs -> arg1 xorl %esi,%esi # oldset -> arg2 call do_notify_resume -1: movl $_TIF_NEED_RESCHED,%edi +1: movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi int_restore_rest: RESTORE_REST cli @@ -608,8 +653,8 @@ bad_iret: /* edi: workmask, edx: work */ retint_careful: CFI_RESTORE_STATE - bt $TIF_NEED_RESCHED,%edx - jnc retint_signal + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz retint_signal TRACE_IRQS_ON sti pushq %rdi @@ -635,7 +680,7 @@ retint_signal: RESTORE_REST cli TRACE_IRQS_OFF - movl $_TIF_NEED_RESCHED,%edi + movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi GET_THREAD_INFO(%rcx) jmp retint_check Index: linux/arch/x86_64/kernel/genapic.c =================================================================== --- linux.orig/arch/x86_64/kernel/genapic.c +++ linux/arch/x86_64/kernel/genapic.c @@ -11,113 +11,41 @@ #include #include #include +#include #include #include #include -#include #include #include -#if defined(CONFIG_ACPI) +#ifdef CONFIG_ACPI #include #endif /* which logical CPU number maps to which CPU (physical APIC ID) */ -u8 x86_cpu_to_apicid[NR_CPUS] __read_mostly = { [0 ... NR_CPUS-1] = BAD_APICID }; +u8 x86_cpu_to_apicid[NR_CPUS] __read_mostly + = { [0 ... NR_CPUS-1] = BAD_APICID }; EXPORT_SYMBOL(x86_cpu_to_apicid); -u8 x86_cpu_to_log_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID }; -extern struct genapic apic_cluster; -extern struct genapic apic_flat; -extern struct genapic apic_physflat; - -struct genapic *genapic = &apic_flat; +u8 x86_cpu_to_log_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID }; +struct genapic __read_mostly *genapic = &apic_flat; /* - * Check the APIC IDs in bios_cpu_apicid and choose the APIC mode. + * Choose the APIC routing mode: */ -void __init clustered_apic_check(void) +void __init setup_apic_routing(void) { - long i; - u8 clusters, max_cluster; - u8 id; - u8 cluster_cnt[NUM_APIC_CLUSTERS]; - int max_apic = 0; - -#if defined(CONFIG_ACPI) - /* - * Some x86_64 machines use physical APIC mode regardless of how many - * procs/clusters are present (x86_64 ES7000 is an example). - */ - if (acpi_fadt.revision > FADT2_REVISION_ID) - if (acpi_fadt.force_apic_physical_destination_mode) { - genapic = &apic_cluster; - goto print; - } -#endif - - memset(cluster_cnt, 0, sizeof(cluster_cnt)); - for (i = 0; i < NR_CPUS; i++) { - id = bios_cpu_apicid[i]; - if (id == BAD_APICID) - continue; - if (id > max_apic) - max_apic = id; - cluster_cnt[APIC_CLUSTERID(id)]++; - } - - /* Don't use clustered mode on AMD platforms. */ - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { - genapic = &apic_physflat; -#ifndef CONFIG_HOTPLUG_CPU - /* In the CPU hotplug case we cannot use broadcast mode - because that opens a race when a CPU is removed. - Stay at physflat mode in this case. - It is bad to do this unconditionally though. Once - we have ACPI platform support for CPU hotplug - we should detect hotplug capablity from ACPI tables and - only do this when really needed. -AK */ - if (max_apic <= 8) - genapic = &apic_flat; -#endif - goto print; - } - - clusters = 0; - max_cluster = 0; - - for (i = 0; i < NUM_APIC_CLUSTERS; i++) { - if (cluster_cnt[i] > 0) { - ++clusters; - if (cluster_cnt[i] > max_cluster) - max_cluster = cluster_cnt[i]; - } - } - - /* - * If we have clusters <= 1 and CPUs <= 8 in cluster 0, then flat mode, - * else if max_cluster <= 4 and cluster_cnt[15] == 0, clustered logical - * else physical mode. - * (We don't use lowest priority delivery + HW APIC IRQ steering, so - * can ignore the clustered logical case and go straight to physical.) - */ - if (clusters <= 1 && max_cluster <= 8 && cluster_cnt[0] == max_cluster) { -#ifdef CONFIG_HOTPLUG_CPU - /* Don't use APIC shortcuts in CPU hotplug to avoid races */ - genapic = &apic_physflat; -#else + if (cpus_weight(cpu_possible_map) <= 8) genapic = &apic_flat; -#endif - } else - genapic = &apic_cluster; + else + genapic = &apic_physflat; -print: printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); } -/* Same for both flat and clustered. */ +/* Same for both flat and physical. */ void send_IPI_self(int vector) { Index: linux/arch/x86_64/kernel/genapic_cluster.c =================================================================== --- linux.orig/arch/x86_64/kernel/genapic_cluster.c +++ /dev/null @@ -1,137 +0,0 @@ -/* - * Copyright 2004 James Cleverdon, IBM. - * Subject to the GNU Public License, v.2 - * - * Clustered APIC subarch code. Up to 255 CPUs, physical delivery. - * (A more realistic maximum is around 230 CPUs.) - * - * Hacked for x86-64 by James Cleverdon from i386 architecture code by - * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and - * James Cleverdon. - */ -#include -#include -#include -#include -#include -#include -#include -#include - - -/* - * Set up the logical destination ID. - * - * Intel recommends to set DFR, LDR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ -static void cluster_init_apic_ldr(void) -{ - unsigned long val, id; - long i, count; - u8 lid; - u8 my_id = hard_smp_processor_id(); - u8 my_cluster = APIC_CLUSTER(my_id); - - /* Create logical APIC IDs by counting CPUs already in cluster. */ - for (count = 0, i = NR_CPUS; --i >= 0; ) { - lid = x86_cpu_to_log_apicid[i]; - if (lid != BAD_APICID && APIC_CLUSTER(lid) == my_cluster) - ++count; - } - /* - * We only have a 4 wide bitmap in cluster mode. There's no way - * to get above 60 CPUs and still give each one it's own bit. - * But, we're using physical IRQ delivery, so we don't care. - * Use bit 3 for the 4th through Nth CPU in each cluster. - */ - if (count >= XAPIC_DEST_CPUS_SHIFT) - count = 3; - id = my_cluster | (1UL << count); - x86_cpu_to_log_apicid[smp_processor_id()] = id; - apic_write(APIC_DFR, APIC_DFR_CLUSTER); - val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; - val |= SET_APIC_LOGICAL_ID(id); - apic_write(APIC_LDR, val); -} - -/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ - -static cpumask_t cluster_target_cpus(void) -{ - return cpumask_of_cpu(0); -} - -static cpumask_t cluster_vector_allocation_domain(int cpu) -{ - cpumask_t domain = CPU_MASK_NONE; - cpu_set(cpu, domain); - return domain; -} - -static void cluster_send_IPI_mask(cpumask_t mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static void cluster_send_IPI_allbutself(int vector) -{ - cpumask_t mask = cpu_online_map; - - cpu_clear(smp_processor_id(), mask); - - if (!cpus_empty(mask)) - cluster_send_IPI_mask(mask, vector); -} - -static void cluster_send_IPI_all(int vector) -{ - cluster_send_IPI_mask(cpu_online_map, vector); -} - -static int cluster_apic_id_registered(void) -{ - return 1; -} - -static unsigned int cluster_cpu_mask_to_apicid(cpumask_t cpumask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - cpu = first_cpu(cpumask); - if ((unsigned)cpu < NR_CPUS) - return x86_cpu_to_apicid[cpu]; - else - return BAD_APICID; -} - -/* cpuid returns the value latched in the HW at reset, not the APIC ID - * register's value. For any box whose BIOS changes APIC IDs, like - * clustered APIC systems, we must use hard_smp_processor_id. - * - * See Intel's IA-32 SW Dev's Manual Vol2 under CPUID. - */ -static unsigned int phys_pkg_id(int index_msb) -{ - return hard_smp_processor_id() >> index_msb; -} - -struct genapic apic_cluster = { - .name = "clustered", - .int_delivery_mode = dest_Fixed, - .int_dest_mode = (APIC_DEST_PHYSICAL != 0), - .target_cpus = cluster_target_cpus, - .vector_allocation_domain = cluster_vector_allocation_domain, - .apic_id_registered = cluster_apic_id_registered, - .init_apic_ldr = cluster_init_apic_ldr, - .send_IPI_all = cluster_send_IPI_all, - .send_IPI_allbutself = cluster_send_IPI_allbutself, - .send_IPI_mask = cluster_send_IPI_mask, - .cpu_mask_to_apicid = cluster_cpu_mask_to_apicid, - .phys_pkg_id = phys_pkg_id, -}; Index: linux/arch/x86_64/kernel/head64.c =================================================================== --- linux.orig/arch/x86_64/kernel/head64.c +++ linux/arch/x86_64/kernel/head64.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -53,7 +54,7 @@ static void __init copy_bootdata(char *r memcpy(saved_command_line, command_line, COMMAND_LINE_SIZE); } -void __init x86_64_start_kernel(char * real_mode_data) +void __init notrace x86_64_start_kernel(char * real_mode_data) { int i; Index: linux/arch/x86_64/kernel/hpet.c =================================================================== --- /dev/null +++ linux/arch/x86_64/kernel/hpet.c @@ -0,0 +1,478 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * TODO: keep hpet disabled for now + */ +int nohpet __initdata = 1; + +unsigned long hpet_address; +static unsigned long hpet_period; /* fsecs / HPET clock */ +unsigned long hpet_tick; /* HPET clocks / interrupt */ +int hpet_use_timer; /* Use counter of hpet for time keeping, otherwise PIT */ + +#define FSEC_PER_TICK (FSEC_PER_SEC / HZ) + +/* + * calibrate_tsc() calibrates the processor TSC in a very simple way, comparing + * it to the HPET timer of known frequency. + */ + +#define TICK_COUNT 100000000 + +unsigned int __init hpet_calibrate_tsc(void) +{ + int tsc_start, hpet_start; + int tsc_now, hpet_now; + unsigned long flags; + + local_irq_save(flags); + local_irq_disable(); + + hpet_start = hpet_readl(HPET_COUNTER); + rdtscl(tsc_start); + + do { + local_irq_disable(); + hpet_now = hpet_readl(HPET_COUNTER); + tsc_now = get_cycles_sync(); + local_irq_restore(flags); + } while ((tsc_now - tsc_start) < TICK_COUNT && + (hpet_now - hpet_start) < TICK_COUNT); + + return (tsc_now - tsc_start) * 1000000000L + / ((hpet_now - hpet_start) * hpet_period / 1000); +} + + + +#ifdef CONFIG_HPET +static __init int late_hpet_init(void) +{ + struct hpet_data hd; + unsigned int ntimer; + + if (!hpet_address) + return 0; + + memset(&hd, 0, sizeof (hd)); + + ntimer = hpet_readl(HPET_ID); + ntimer = (ntimer & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT; + ntimer++; + + /* + * Register with driver. + * Timer0 and Timer1 is used by platform. + */ + hd.hd_phys_address = hpet_address; + hd.hd_address = (void __iomem *)fix_to_virt(FIX_HPET_BASE); + hd.hd_nirqs = ntimer; + hd.hd_flags = HPET_DATA_PLATFORM; + hpet_reserve_timer(&hd, 0); +#ifdef CONFIG_HPET_EMULATE_RTC + hpet_reserve_timer(&hd, 1); +#endif + hd.hd_irq[0] = HPET_LEGACY_8254; + hd.hd_irq[1] = HPET_LEGACY_RTC; + if (ntimer > 2) { + struct hpet *hpet; + struct hpet_timer *timer; + int i; + + hpet = (struct hpet *) fix_to_virt(FIX_HPET_BASE); + timer = &hpet->hpet_timers[2]; + for (i = 2; i < ntimer; timer++, i++) + hd.hd_irq[i] = (timer->hpet_config & + Tn_INT_ROUTE_CNF_MASK) >> + Tn_INT_ROUTE_CNF_SHIFT; + + } + + hpet_alloc(&hd); + return 0; +} +fs_initcall(late_hpet_init); +#endif + +static int hpet_timer_stop_set_go(unsigned long tick) +{ + unsigned int cfg; + +/* + * Stop the timers and reset the main counter. + */ + + cfg = hpet_readl(HPET_CFG); + cfg &= ~(HPET_CFG_ENABLE | HPET_CFG_LEGACY); + hpet_writel(cfg, HPET_CFG); + hpet_writel(0, HPET_COUNTER); + hpet_writel(0, HPET_COUNTER + 4); + +/* + * Set up timer 0, as periodic with first interrupt to happen at hpet_tick, + * and period also hpet_tick. + */ + if (hpet_use_timer) { + hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | + HPET_TN_32BIT, HPET_T0_CFG); + hpet_writel(hpet_tick, HPET_T0_CMP); /* next interrupt */ + hpet_writel(hpet_tick, HPET_T0_CMP); /* period */ + cfg |= HPET_CFG_LEGACY; + } +/* + * Go! + */ + + cfg |= HPET_CFG_ENABLE; + hpet_writel(cfg, HPET_CFG); + + return 0; +} + +int hpet_arch_init(void) +{ + unsigned int id; + + if (!hpet_address) + return -1; + set_fixmap_nocache(FIX_HPET_BASE, hpet_address); + __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE); + +/* + * Read the period, compute tick and quotient. + */ + + id = hpet_readl(HPET_ID); + + if (!(id & HPET_ID_VENDOR) || !(id & HPET_ID_NUMBER)) + return -1; + + hpet_period = hpet_readl(HPET_PERIOD); + if (hpet_period < 100000 || hpet_period > 100000000) + return -1; + + hpet_tick = (FSEC_PER_TICK + hpet_period / 2) / hpet_period; + + hpet_use_timer = (id & HPET_ID_LEGSUP); + + return hpet_timer_stop_set_go(hpet_tick); +} + +int hpet_reenable(void) +{ + return hpet_timer_stop_set_go(hpet_tick); +} + +int hpet_stop(void) +{ + return hpet_timer_stop_set_go(0); +} + +#ifdef CONFIG_HPET_EMULATE_RTC +/* HPET in LegacyReplacement Mode eats up RTC interrupt line. When, HPET + * is enabled, we support RTC interrupt functionality in software. + * RTC has 3 kinds of interrupts: + * 1) Update Interrupt - generate an interrupt, every sec, when RTC clock + * is updated + * 2) Alarm Interrupt - generate an interrupt at a specific time of day + * 3) Periodic Interrupt - generate periodic interrupt, with frequencies + * 2Hz-8192Hz (2Hz-64Hz for non-root user) (all freqs in powers of 2) + * (1) and (2) above are implemented using polling at a frequency of + * 64 Hz. The exact frequency is a tradeoff between accuracy and interrupt + * overhead. (DEFAULT_RTC_INT_FREQ) + * For (3), we use interrupts at 64Hz or user specified periodic + * frequency, whichever is higher. + */ +#include + +#define DEFAULT_RTC_INT_FREQ 64 +#define RTC_NUM_INTS 1 + +static unsigned long UIE_on; +static unsigned long prev_update_sec; + +static unsigned long AIE_on; +static struct rtc_time alarm_time; + +static unsigned long PIE_on; +static unsigned long PIE_freq = DEFAULT_RTC_INT_FREQ; +static unsigned long PIE_count; + +static unsigned long hpet_rtc_int_freq; /* RTC interrupt frequency */ +static unsigned int hpet_t1_cmp; /* cached comparator register */ + +int is_hpet_enabled(void) +{ + return hpet_address != 0; +} + +/* + * Timer 1 for RTC, we do not use periodic interrupt feature, + * even if HPET supports periodic interrupts on Timer 1. + * The reason being, to set up a periodic interrupt in HPET, we need to + * stop the main counter. And if we do that everytime someone diables/enables + * RTC, we will have adverse effect on main kernel timer running on Timer 0. + * So, for the time being, simulate the periodic interrupt in software. + * + * hpet_rtc_timer_init() is called for the first time and during subsequent + * interuppts reinit happens through hpet_rtc_timer_reinit(). + */ +int hpet_rtc_timer_init(void) +{ + unsigned int cfg, cnt; + unsigned long flags; + + if (!is_hpet_enabled()) + return 0; + /* + * Set the counter 1 and enable the interrupts. + */ + if (PIE_on && (PIE_freq > DEFAULT_RTC_INT_FREQ)) + hpet_rtc_int_freq = PIE_freq; + else + hpet_rtc_int_freq = DEFAULT_RTC_INT_FREQ; + + local_irq_save(flags); + cnt = hpet_readl(HPET_COUNTER); + cnt += ((hpet_tick*HZ)/hpet_rtc_int_freq); + hpet_writel(cnt, HPET_T1_CMP); + hpet_t1_cmp = cnt; + local_irq_restore(flags); + + cfg = hpet_readl(HPET_T1_CFG); + cfg &= ~HPET_TN_PERIODIC; + cfg |= HPET_TN_ENABLE | HPET_TN_32BIT; + hpet_writel(cfg, HPET_T1_CFG); + + return 1; +} + +static void hpet_rtc_timer_reinit(void) +{ + unsigned int cfg, cnt; + + if (unlikely(!(PIE_on | AIE_on | UIE_on))) { + cfg = hpet_readl(HPET_T1_CFG); + cfg &= ~HPET_TN_ENABLE; + hpet_writel(cfg, HPET_T1_CFG); + return; + } + + if (PIE_on && (PIE_freq > DEFAULT_RTC_INT_FREQ)) + hpet_rtc_int_freq = PIE_freq; + else + hpet_rtc_int_freq = DEFAULT_RTC_INT_FREQ; + + /* It is more accurate to use the comparator value than current count.*/ + cnt = hpet_t1_cmp; + cnt += hpet_tick*HZ/hpet_rtc_int_freq; + hpet_writel(cnt, HPET_T1_CMP); + hpet_t1_cmp = cnt; +} + +/* + * The functions below are called from rtc driver. + * Return 0 if HPET is not being used. + * Otherwise do the necessary changes and return 1. + */ +int hpet_mask_rtc_irq_bit(unsigned long bit_mask) +{ + if (!is_hpet_enabled()) + return 0; + + if (bit_mask & RTC_UIE) + UIE_on = 0; + if (bit_mask & RTC_PIE) + PIE_on = 0; + if (bit_mask & RTC_AIE) + AIE_on = 0; + + return 1; +} + +int hpet_set_rtc_irq_bit(unsigned long bit_mask) +{ + int timer_init_reqd = 0; + + if (!is_hpet_enabled()) + return 0; + + if (!(PIE_on | AIE_on | UIE_on)) + timer_init_reqd = 1; + + if (bit_mask & RTC_UIE) { + UIE_on = 1; + } + if (bit_mask & RTC_PIE) { + PIE_on = 1; + PIE_count = 0; + } + if (bit_mask & RTC_AIE) { + AIE_on = 1; + } + + if (timer_init_reqd) + hpet_rtc_timer_init(); + + return 1; +} + +int hpet_set_alarm_time(unsigned char hrs, unsigned char min, unsigned char sec) +{ + if (!is_hpet_enabled()) + return 0; + + alarm_time.tm_hour = hrs; + alarm_time.tm_min = min; + alarm_time.tm_sec = sec; + + return 1; +} + +int hpet_set_periodic_freq(unsigned long freq) +{ + if (!is_hpet_enabled()) + return 0; + + PIE_freq = freq; + PIE_count = 0; + + return 1; +} + +int hpet_rtc_dropped_irq(void) +{ + if (!is_hpet_enabled()) + return 0; + + return 1; +} + +irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id) +{ + struct rtc_time curr_time; + unsigned long rtc_int_flag = 0; + int call_rtc_interrupt = 0; + + hpet_rtc_timer_reinit(); + + if (UIE_on | AIE_on) { + rtc_get_rtc_time(&curr_time); + } + if (UIE_on) { + if (curr_time.tm_sec != prev_update_sec) { + /* Set update int info, call real rtc int routine */ + call_rtc_interrupt = 1; + rtc_int_flag = RTC_UF; + prev_update_sec = curr_time.tm_sec; + } + } + if (PIE_on) { + PIE_count++; + if (PIE_count >= hpet_rtc_int_freq/PIE_freq) { + /* Set periodic int info, call real rtc int routine */ + call_rtc_interrupt = 1; + rtc_int_flag |= RTC_PF; + PIE_count = 0; + } + } + if (AIE_on) { + if ((curr_time.tm_sec == alarm_time.tm_sec) && + (curr_time.tm_min == alarm_time.tm_min) && + (curr_time.tm_hour == alarm_time.tm_hour)) { + /* Set alarm int info, call real rtc int routine */ + call_rtc_interrupt = 1; + rtc_int_flag |= RTC_AF; + } + } + if (call_rtc_interrupt) { + rtc_int_flag |= (RTC_IRQF | (RTC_NUM_INTS << 8)); + rtc_interrupt(rtc_int_flag, dev_id); + } + return IRQ_HANDLED; +} +#endif + +static int __init nohpet_setup(char *s) +{ + nohpet = 1; + return 1; +} + +__setup("nohpet", nohpet_setup); + +#define HPET_MASK 0xFFFFFFFF +#define HPET_SHIFT 22 + +/* FSEC = 10^-15 NSEC = 10^-9 */ +#define FSEC_PER_NSEC 1000000 + +static void *hpet_ptr; + +static cycle_t read_hpet(void) +{ + return (cycle_t)readl(hpet_ptr); +} + +static cycle_t __vsyscall_fn vread_hpet(void) +{ + return (cycle_t)readl((void *)fix_to_virt(VSYSCALL_HPET) + 0xf0); +} + +struct clocksource clocksource_hpet = { + .name = "hpet", + .rating = 250, + .read = read_hpet, + .mask = (cycle_t)HPET_MASK, + .mult = 0, /* set below */ + .shift = HPET_SHIFT, + .is_continuous = 1, + .vread = vread_hpet, +}; + +static int __init init_hpet_clocksource(void) +{ + unsigned long hpet_period; + void __iomem *hpet_base; + u64 tmp; + + if (!hpet_address) + return -ENODEV; + + /* calculate the hpet address: */ + hpet_base = + (void __iomem*)ioremap_nocache(hpet_address, HPET_MMAP_SIZE); + hpet_ptr = hpet_base + HPET_COUNTER; + + /* calculate the frequency: */ + hpet_period = readl(hpet_base + HPET_PERIOD); + + /* + * hpet period is in femto seconds per cycle + * so we need to convert this to ns/cyc units + * aproximated by mult/2^shift + * + * fsec/cyc * 1nsec/1000000fsec = nsec/cyc = mult/2^shift + * fsec/cyc * 1ns/1000000fsec * 2^shift = mult + * fsec/cyc * 2^shift * 1nsec/1000000fsec = mult + * (fsec/cyc << shift)/1000000 = mult + * (hpet_period << shift)/FSEC_PER_NSEC = mult + */ + tmp = (u64)hpet_period << HPET_SHIFT; + do_div(tmp, FSEC_PER_NSEC); + clocksource_hpet.mult = (u32)tmp; + + return clocksource_register(&clocksource_hpet); +} + +module_init(init_hpet_clocksource); Index: linux/arch/x86_64/kernel/i8259.c =================================================================== --- linux.orig/arch/x86_64/kernel/i8259.c +++ linux/arch/x86_64/kernel/i8259.c @@ -96,12 +96,13 @@ void (*interrupt[NR_IRQS])(void) = { */ static int i8259A_auto_eoi; -DEFINE_SPINLOCK(i8259A_lock); static void mask_and_ack_8259A(unsigned int); +DEFINE_RAW_SPINLOCK(i8259A_lock); static struct irq_chip i8259A_chip = { .name = "XT-PIC", .mask = disable_8259A_irq, + .disable = disable_8259A_irq, .unmask = enable_8259A_irq, .mask_ack = mask_and_ack_8259A, }; @@ -394,7 +395,8 @@ device_initcall(i8259A_init_sysfs); * IRQ2 is cascade interrupt to second interrupt controller */ -static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL}; +static struct irqaction irq2 = { no_action, IRQF_NODELAY, CPU_MASK_NONE, "cascade", NULL, NULL}; + DEFINE_PER_CPU(vector_irq_t, vector_irq) = { [0 ... FIRST_EXTERNAL_VECTOR - 1] = -1, [FIRST_EXTERNAL_VECTOR + 0] = 0, Index: linux/arch/x86_64/kernel/io_apic.c =================================================================== --- linux.orig/arch/x86_64/kernel/io_apic.c +++ linux/arch/x86_64/kernel/io_apic.c @@ -62,8 +62,8 @@ int timer_over_8254 __initdata = 1; /* Where if anywhere is the i8259 connect in external int mode */ static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; -static DEFINE_SPINLOCK(ioapic_lock); -DEFINE_SPINLOCK(vector_lock); +static DEFINE_RAW_SPINLOCK(ioapic_lock); +DEFINE_RAW_SPINLOCK(vector_lock); /* * # of IRQ routing registers @@ -149,6 +149,9 @@ static inline void io_apic_sync(unsigned reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \ reg ACTION; \ io_apic_modify(entry->apic, reg); \ + /* Force POST flush by reading: */ \ + reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \ + \ if (!entry->next) \ break; \ entry = irq_2_pin + entry->next; \ @@ -290,10 +293,8 @@ static void add_pin_to_irq(unsigned int static void name##_IO_APIC_irq (unsigned int irq) \ __DO_ACTION(R, ACTION, FINAL) -DO_ACTION( __mask, 0, |= 0x00010000, io_apic_sync(entry->apic) ) - /* mask = 1 */ -DO_ACTION( __unmask, 0, &= 0xfffeffff, ) - /* mask = 0 */ +DO_ACTION( __mask, 0, |= 0x00010000, ) /* mask = 1 */ +DO_ACTION( __unmask, 0, &= 0xfffeffff, ) /* mask = 0 */ static void mask_IO_APIC_irq (unsigned int irq) { @@ -789,7 +790,6 @@ static void ioapic_register_intr(int irq set_irq_chip_and_handler_name(irq, &ioapic_chip, handle_fasteoi_irq, "fasteoi"); else { - irq_desc[irq].status |= IRQ_DELAYED_DISABLE; set_irq_chip_and_handler_name(irq, &ioapic_chip, handle_edge_irq, "edge"); } Index: linux/arch/x86_64/kernel/irq.c =================================================================== --- linux.orig/arch/x86_64/kernel/irq.c +++ linux/arch/x86_64/kernel/irq.c @@ -110,10 +110,18 @@ asmlinkage unsigned int do_IRQ(struct pt unsigned vector = ~regs->orig_rax; unsigned irq; + irq_show_regs_callback(smp_processor_id(), regs); + exit_idle(); irq_enter(); irq = __get_cpu_var(vector_irq)[vector]; +#ifdef CONFIG_EVENT_TRACE + if (irq == trace_user_trigger_irq) + user_trace_start(); +#endif + trace_special(regs->rip, irq, 0); + #ifdef CONFIG_DEBUG_STACKOVERFLOW stack_overflow_check(regs); #endif Index: linux/arch/x86_64/kernel/mce.c =================================================================== --- linux.orig/arch/x86_64/kernel/mce.c +++ linux/arch/x86_64/kernel/mce.c @@ -641,7 +641,6 @@ static __cpuinit int mce_create_device(u return err; } -#ifdef CONFIG_HOTPLUG_CPU static void mce_remove_device(unsigned int cpu) { int i; @@ -674,7 +673,6 @@ mce_cpu_callback(struct notifier_block * static struct notifier_block mce_cpu_notifier = { .notifier_call = mce_cpu_callback, }; -#endif static __init int mce_init_device(void) { Index: linux/arch/x86_64/kernel/mce_amd.c =================================================================== --- linux.orig/arch/x86_64/kernel/mce_amd.c +++ linux/arch/x86_64/kernel/mce_amd.c @@ -551,7 +551,6 @@ out: return err; } -#ifdef CONFIG_HOTPLUG_CPU /* * let's be hotplug friendly. * in case of multiple core processors, the first core always takes ownership @@ -594,12 +593,14 @@ static void threshold_remove_bank(unsign sprintf(name, "threshold_bank%i", bank); +#ifdef CONFIG_SMP /* sibling symlink */ if (shared_bank[bank] && b->blocks->cpu != cpu) { sysfs_remove_link(&per_cpu(device_mce, cpu).kobj, name); per_cpu(threshold_banks, cpu)[bank] = NULL; return; } +#endif /* remove all sibling symlinks before unregistering */ for_each_cpu_mask(i, b->cpus) { @@ -656,7 +657,6 @@ static int threshold_cpu_callback(struct static struct notifier_block threshold_cpu_notifier = { .notifier_call = threshold_cpu_callback, }; -#endif /* CONFIG_HOTPLUG_CPU */ static __init int threshold_init_device(void) { Index: linux/arch/x86_64/kernel/mpparse.c =================================================================== --- linux.orig/arch/x86_64/kernel/mpparse.c +++ linux/arch/x86_64/kernel/mpparse.c @@ -302,7 +302,7 @@ static int __init smp_read_mpc(struct mp } } } - clustered_apic_check(); + setup_apic_routing(); if (!num_processors) printk(KERN_ERR "MPTABLE: no processors registered!\n"); return num_processors; Index: linux/arch/x86_64/kernel/nmi.c =================================================================== --- linux.orig/arch/x86_64/kernel/nmi.c +++ linux/arch/x86_64/kernel/nmi.c @@ -198,7 +198,9 @@ void nmi_watchdog_default(void) static __init void nmi_cpu_busy(void *data) { volatile int *endflag = data; +#ifndef CONFIG_PREEMPT_RT local_irq_enable_in_hardirq(); +#endif /* Intentionally don't use cpu_relax here. This is to make sure that the performance counter really ticks, even if there is a simulator or similar that catches the @@ -263,7 +265,7 @@ int __init check_nmi_watchdog (void) if (nmi_watchdog == NMI_LOCAL_APIC) { struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk); - nmi_hz = 1; + nmi_hz = 10000; /* * On Intel CPUs with ARCH_PERFMON only 32 bits in the counter * are writable, with higher bits sign extending from bit 31. @@ -358,6 +360,33 @@ void enable_timer_nmi_watchdog(void) } } +static void __acpi_nmi_disable(void *__unused) +{ + apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); +} + +/* + * Disable timer based NMIs on all CPUs: + */ +void acpi_nmi_disable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_disable, NULL, 0, 1); +} + +static void __acpi_nmi_enable(void *__unused) +{ + apic_write(APIC_LVT0, APIC_DM_NMI); +} + +/* + * Enable timer based NMIs on all CPUs: + */ +void acpi_nmi_enable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_enable, NULL, 0, 1); +} #ifdef CONFIG_PM static int nmi_pm_active; /* nmi_active before suspend */ @@ -778,14 +807,53 @@ void touch_nmi_watchdog (void) touch_softlockup_watchdog(); } +int nmi_show_regs[NR_CPUS]; + +void nmi_show_all_regs(void) +{ + int i; + + if (nmi_watchdog == NMI_NONE) + return; + if (system_state == SYSTEM_BOOTING) { + printk("nmi_show_all_regs(): system state %d, not doing.\n", + system_state); + return; + } + + for_each_online_cpu(i) + nmi_show_regs[i] = 1; + for_each_online_cpu(i) + while (nmi_show_regs[i] == 1) + barrier(); +} + +static DEFINE_RAW_SPINLOCK(nmi_print_lock); + +notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return; + + nmi_show_regs[cpu] = 0; + spin_lock(&nmi_print_lock); + printk("NMI show regs on CPU#%d:\n", cpu); + show_regs(regs); + spin_unlock(&nmi_print_lock); +} + int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { int sum; int touched = 0; - struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk); + int cpu = smp_processor_id(); + struct nmi_watchdog_ctlblk *wd = &per_cpu(nmi_watchdog_ctlblk, cpu); u64 dummy; int rc=0; + irq_show_regs_callback(cpu, regs); + __profile_tick(CPU_PROFILING, regs); + /* check for other users first */ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP) { @@ -794,6 +862,7 @@ int __kprobes nmi_watchdog_tick(struct p } sum = read_pda(apic_timer_irqs); + if (__get_cpu_var(nmi_touch)) { __get_cpu_var(nmi_touch) = 0; touched = 1; @@ -812,9 +881,20 @@ int __kprobes nmi_watchdog_tick(struct p * wait a few IRQs (5 seconds) before doing the oops ... */ local_inc(&__get_cpu_var(alert_counter)); - if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz) + if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz) { + int i; + + for_each_online_cpu(i) { + if (i == cpu) + continue; + nmi_show_regs[i] = 1; + while (nmi_show_regs[i] == 1) + cpu_relax(); + } + die_nmi("NMI Watchdog detected LOCKUP on CPU %d\n", regs, panic_on_timeout); + } } else { __get_cpu_var(last_irq_sum) = sum; local_set(&__get_cpu_var(alert_counter), 0); Index: linux/arch/x86_64/kernel/pmtimer.c =================================================================== --- linux.orig/arch/x86_64/kernel/pmtimer.c +++ linux/arch/x86_64/kernel/pmtimer.c @@ -24,15 +24,6 @@ #include #include -/* The I/O port the PMTMR resides at. - * The location is detected during setup_arch(), - * in arch/i386/kernel/acpi/boot.c */ -u32 pmtmr_ioport __read_mostly; - -/* value of the Power timer at last timer interrupt */ -static u32 offset_delay; -static u32 last_pmtmr_tick; - #define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */ static inline u32 cyc2us(u32 cycles) @@ -48,38 +39,6 @@ static inline u32 cyc2us(u32 cycles) return (cycles >> 10); } -int pmtimer_mark_offset(void) -{ - static int first_run = 1; - unsigned long tsc; - u32 lost; - - u32 tick = inl(pmtmr_ioport); - u32 delta; - - delta = cyc2us((tick - last_pmtmr_tick) & ACPI_PM_MASK); - - last_pmtmr_tick = tick; - monotonic_base += delta * NSEC_PER_USEC; - - delta += offset_delay; - - lost = delta / (USEC_PER_SEC / HZ); - offset_delay = delta % (USEC_PER_SEC / HZ); - - rdtscll(tsc); - vxtime.last_tsc = tsc - offset_delay * (u64)cpu_khz / 1000; - - /* don't calculate delay for first run, - or if we've got less then a tick */ - if (first_run || (lost < 1)) { - first_run = 0; - offset_delay = 0; - } - - return lost - 1; -} - static unsigned pmtimer_wait_tick(void) { u32 a, b; @@ -101,23 +60,6 @@ void pmtimer_wait(unsigned us) } while (cyc2us(b - a) < us); } -void pmtimer_resume(void) -{ - last_pmtmr_tick = inl(pmtmr_ioport); -} - -unsigned int do_gettimeoffset_pm(void) -{ - u32 now, offset, delta = 0; - - offset = last_pmtmr_tick; - now = inl(pmtmr_ioport); - delta = (now - offset) & ACPI_PM_MASK; - - return offset_delay + cyc2us(delta); -} - - static int __init nopmtimer_setup(char *s) { pmtmr_ioport = 0; Index: linux/arch/x86_64/kernel/process.c =================================================================== --- linux.orig/arch/x86_64/kernel/process.c +++ linux/arch/x86_64/kernel/process.c @@ -112,9 +112,9 @@ static void default_idle(void) current_thread_info()->status &= ~TS_POLLING; smp_mb__after_clear_bit(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { local_irq_disable(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) safe_halt(); else local_irq_enable(); @@ -137,7 +137,7 @@ static void poll_idle (void) "rep; nop;" "je 2b;" : : - "i" (_TIF_NEED_RESCHED), + "i" (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED), "m" (current_thread_info()->flags)); } @@ -207,7 +207,8 @@ void cpu_idle (void) current_thread_info()->status |= TS_POLLING; /* endless idle loop with no priority at all */ while (1) { - while (!need_resched()) { + hrtimer_stop_sched_tick(); + while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); if (__get_cpu_var(cpu_idle_state)) @@ -227,9 +228,12 @@ void cpu_idle (void) __exit_idle(); } - preempt_enable_no_resched(); - schedule(); + hrtimer_restart_sched_tick(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } @@ -245,10 +249,10 @@ void cpu_idle (void) */ void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) __mwait(eax, ecx); } } @@ -257,7 +261,7 @@ void mwait_idle_with_hints(unsigned long static void mwait_idle(void) { local_irq_enable(); - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) mwait_idle_with_hints(0,0); } @@ -358,7 +362,7 @@ void exit_thread(void) struct thread_struct *t = &me->thread; if (me->thread.io_bitmap_ptr) { - struct tss_struct *tss = &per_cpu(init_tss, get_cpu()); + struct tss_struct *tss; kfree(t->io_bitmap_ptr); t->io_bitmap_ptr = NULL; @@ -366,6 +370,7 @@ void exit_thread(void) /* * Careful, clear this in the TSS too: */ + tss = &per_cpu(init_tss, get_cpu()); memset(tss->io_bitmap, 0xff, t->io_bitmap_max); t->io_bitmap_max = 0; put_cpu(); Index: linux/arch/x86_64/kernel/setup64.c =================================================================== --- linux.orig/arch/x86_64/kernel/setup64.c +++ linux/arch/x86_64/kernel/setup64.c @@ -115,7 +115,7 @@ void __init setup_per_cpu_areas(void) } } -void pda_init(int cpu) +void notrace pda_init(int cpu) { struct x8664_pda *pda = cpu_pda(cpu); @@ -189,7 +189,7 @@ unsigned long kernel_eflags; * 'CPU state barrier', nothing should get across. * A lot of state is already set up in PDA init. */ -void __cpuinit cpu_init (void) +void __cpuinit notrace cpu_init (void) { int cpu = stack_smp_processor_id(); struct tss_struct *t = &per_cpu(init_tss, cpu); Index: linux/arch/x86_64/kernel/signal.c =================================================================== --- linux.orig/arch/x86_64/kernel/signal.c +++ linux/arch/x86_64/kernel/signal.c @@ -396,6 +396,13 @@ static void do_signal(struct pt_regs *re int signr; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux/arch/x86_64/kernel/smp.c =================================================================== --- linux.orig/arch/x86_64/kernel/smp.c +++ linux/arch/x86_64/kernel/smp.c @@ -57,7 +57,7 @@ union smp_flush_state { struct mm_struct *flush_mm; unsigned long flush_va; #define FLUSH_ALL -1ULL - spinlock_t tlbstate_lock; + raw_spinlock_t tlbstate_lock; }; char pad[SMP_CACHE_BYTES]; } ____cacheline_aligned; @@ -296,10 +296,20 @@ void smp_send_reschedule(int cpu) } /* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + send_IPI_allbutself(RESCHEDULE_VECTOR); +} + +/* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); struct call_data_struct { void (*func) (void *info); Index: linux/arch/x86_64/kernel/smpboot.c =================================================================== --- linux.orig/arch/x86_64/kernel/smpboot.c +++ linux/arch/x86_64/kernel/smpboot.c @@ -147,217 +147,6 @@ static void __cpuinit smp_store_cpu_info print_cpu_info(c); } -/* - * New Funky TSC sync algorithm borrowed from IA64. - * Main advantage is that it doesn't reset the TSCs fully and - * in general looks more robust and it works better than my earlier - * attempts. I believe it was written by David Mosberger. Some minor - * adjustments for x86-64 by me -AK - * - * Original comment reproduced below. - * - * Synchronize TSC of the current (slave) CPU with the TSC of the - * MASTER CPU (normally the time-keeper CPU). We use a closed loop to - * eliminate the possibility of unaccounted-for errors (such as - * getting a machine check in the middle of a calibration step). The - * basic idea is for the slave to ask the master what itc value it has - * and to read its own itc before and after the master responds. Each - * iteration gives us three timestamps: - * - * slave master - * - * t0 ---\ - * ---\ - * ---> - * tm - * /--- - * /--- - * t1 <--- - * - * - * The goal is to adjust the slave's TSC such that tm falls exactly - * half-way between t0 and t1. If we achieve this, the clocks are - * synchronized provided the interconnect between the slave and the - * master is symmetric. Even if the interconnect were asymmetric, we - * would still know that the synchronization error is smaller than the - * roundtrip latency (t0 - t1). - * - * When the interconnect is quiet and symmetric, this lets us - * synchronize the TSC to within one or two cycles. However, we can - * only *guarantee* that the synchronization is accurate to within a - * round-trip time, which is typically in the range of several hundred - * cycles (e.g., ~500 cycles). In practice, this means that the TSCs - * are usually almost perfectly synchronized, but we shouldn't assume - * that the accuracy is much better than half a micro second or so. - * - * [there are other errors like the latency of RDTSC and of the - * WRMSR. These can also account to hundreds of cycles. So it's - * probably worse. It claims 153 cycles error on a dual Opteron, - * but I suspect the numbers are actually somewhat worse -AK] - */ - -#define MASTER 0 -#define SLAVE (SMP_CACHE_BYTES/8) - -/* Intentionally don't use cpu_relax() while TSC synchronization - because we don't want to go into funky power save modi or cause - hypervisors to schedule us away. Going to sleep would likely affect - latency and low latency is the primary objective here. -AK */ -#define no_cpu_relax() barrier() - -static __cpuinitdata DEFINE_SPINLOCK(tsc_sync_lock); -static volatile __cpuinitdata unsigned long go[SLAVE + 1]; -static int notscsync __cpuinitdata; - -#undef DEBUG_TSC_SYNC - -#define NUM_ROUNDS 64 /* magic value */ -#define NUM_ITERS 5 /* likewise */ - -/* Callback on boot CPU */ -static __cpuinit void sync_master(void *arg) -{ - unsigned long flags, i; - - go[MASTER] = 0; - - local_irq_save(flags); - { - for (i = 0; i < NUM_ROUNDS*NUM_ITERS; ++i) { - while (!go[MASTER]) - no_cpu_relax(); - go[MASTER] = 0; - rdtscll(go[SLAVE]); - } - } - local_irq_restore(flags); -} - -/* - * Return the number of cycles by which our tsc differs from the tsc - * on the master (time-keeper) CPU. A positive number indicates our - * tsc is ahead of the master, negative that it is behind. - */ -static inline long -get_delta(long *rt, long *master) -{ - unsigned long best_t0 = 0, best_t1 = ~0UL, best_tm = 0; - unsigned long tcenter, t0, t1, tm; - int i; - - for (i = 0; i < NUM_ITERS; ++i) { - rdtscll(t0); - go[MASTER] = 1; - while (!(tm = go[SLAVE])) - no_cpu_relax(); - go[SLAVE] = 0; - rdtscll(t1); - - if (t1 - t0 < best_t1 - best_t0) - best_t0 = t0, best_t1 = t1, best_tm = tm; - } - - *rt = best_t1 - best_t0; - *master = best_tm - best_t0; - - /* average best_t0 and best_t1 without overflow: */ - tcenter = (best_t0/2 + best_t1/2); - if (best_t0 % 2 + best_t1 % 2 == 2) - ++tcenter; - return tcenter - best_tm; -} - -static __cpuinit void sync_tsc(unsigned int master) -{ - int i, done = 0; - long delta, adj, adjust_latency = 0; - unsigned long flags, rt, master_time_stamp, bound; -#ifdef DEBUG_TSC_SYNC - static struct syncdebug { - long rt; /* roundtrip time */ - long master; /* master's timestamp */ - long diff; /* difference between midpoint and master's timestamp */ - long lat; /* estimate of tsc adjustment latency */ - } t[NUM_ROUNDS] __cpuinitdata; -#endif - - printk(KERN_INFO "CPU %d: Syncing TSC to CPU %u.\n", - smp_processor_id(), master); - - go[MASTER] = 1; - - /* It is dangerous to broadcast IPI as cpus are coming up, - * as they may not be ready to accept them. So since - * we only need to send the ipi to the boot cpu direct - * the message, and avoid the race. - */ - smp_call_function_single(master, sync_master, NULL, 1, 0); - - while (go[MASTER]) /* wait for master to be ready */ - no_cpu_relax(); - - spin_lock_irqsave(&tsc_sync_lock, flags); - { - for (i = 0; i < NUM_ROUNDS; ++i) { - delta = get_delta(&rt, &master_time_stamp); - if (delta == 0) { - done = 1; /* let's lock on to this... */ - bound = rt; - } - - if (!done) { - unsigned long t; - if (i > 0) { - adjust_latency += -delta; - adj = -delta + adjust_latency/4; - } else - adj = -delta; - - rdtscll(t); - wrmsrl(MSR_IA32_TSC, t + adj); - } -#ifdef DEBUG_TSC_SYNC - t[i].rt = rt; - t[i].master = master_time_stamp; - t[i].diff = delta; - t[i].lat = adjust_latency/4; -#endif - } - } - spin_unlock_irqrestore(&tsc_sync_lock, flags); - -#ifdef DEBUG_TSC_SYNC - for (i = 0; i < NUM_ROUNDS; ++i) - printk("rt=%5ld master=%5ld diff=%5ld adjlat=%5ld\n", - t[i].rt, t[i].master, t[i].diff, t[i].lat); -#endif - - printk(KERN_INFO - "CPU %d: synchronized TSC with CPU %u (last diff %ld cycles, " - "maxerr %lu cycles)\n", - smp_processor_id(), master, delta, rt); -} - -static void __cpuinit tsc_sync_wait(void) -{ - /* - * When the CPU has synchronized TSCs assume the BIOS - * or the hardware already synced. Otherwise we could - * mess up a possible perfect synchronization with a - * not-quite-perfect algorithm. - */ - if (notscsync || !cpu_has_tsc || !unsynchronized_tsc()) - return; - sync_tsc(0); -} - -static __init int notscsync_setup(char *s) -{ - notscsync = 1; - return 1; -} -__setup("notscsync", notscsync_setup); - static atomic_t init_deasserted __cpuinitdata; /* @@ -531,7 +320,7 @@ static inline void set_cpu_sibling_map(i /* * Setup code on secondary processor (after comming out of the trampoline) */ -void __cpuinit start_secondary(void) +void __cpuinit notrace start_secondary(void) { /* * Dont put anything before smp_callin(), SMP @@ -545,6 +334,11 @@ void __cpuinit start_secondary(void) /* otherwise gcc will move up the smp_processor_id before the cpu_init */ barrier(); + /* + * Check TSC sync first: + */ + check_tsc_sync_target(); + Dprintk("cpu %d: setting up apic clock\n", smp_processor_id()); setup_secondary_APIC_clock(); @@ -564,14 +358,6 @@ void __cpuinit start_secondary(void) */ set_cpu_sibling_map(smp_processor_id()); - /* - * Wait for TSC sync to not schedule things before. - * We still process interrupts, which could see an inconsistent - * time in that window unfortunately. - * Do this here because TSC sync has global unprotected state. - */ - tsc_sync_wait(); - /* * We need to hold call_lock, so there is no inconsistency * between the time smp_call_function() determines number of @@ -591,6 +377,7 @@ void __cpuinit start_secondary(void) cpu_set(smp_processor_id(), cpu_online_map); per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; spin_unlock(&vector_lock); + unlock_ipi_call_lock(); cpu_idle(); @@ -1165,6 +952,11 @@ int __cpuinit __cpu_up(unsigned int cpu) /* Unleash the CPU! */ Dprintk("waiting for cpu %d\n", cpu); + /* + * Make sure and check TSC sync: + */ + check_tsc_sync_source(cpu); + while (!cpu_isset(cpu, cpu_online_map)) cpu_relax(); err = 0; @@ -1180,7 +972,6 @@ void __init smp_cpus_done(unsigned int m smp_cleanup_boot(); setup_ioapic_dest(); check_nmi_watchdog(); - time_init_gtod(); } #ifdef CONFIG_HOTPLUG_CPU Index: linux/arch/x86_64/kernel/time.c =================================================================== --- linux.orig/arch/x86_64/kernel/time.c +++ linux/arch/x86_64/kernel/time.c @@ -41,142 +41,28 @@ #include #include #include +#include +#include #include +#include -#ifdef CONFIG_CPU_FREQ -static void cpufreq_delayed_get(void); -#endif extern void i8254_timer_resume(void); extern int using_apic_timer; +extern struct clock_event_device pit_clockevent; static char *timename = NULL; DEFINE_SPINLOCK(rtc_lock); EXPORT_SYMBOL(rtc_lock); -DEFINE_SPINLOCK(i8253_lock); - -int nohpet __initdata = 0; -static int notsc __initdata = 0; +DEFINE_RAW_SPINLOCK(i8253_lock); #define USEC_PER_TICK (USEC_PER_SEC / HZ) #define NSEC_PER_TICK (NSEC_PER_SEC / HZ) -#define FSEC_PER_TICK (FSEC_PER_SEC / HZ) -#define NS_SCALE 10 /* 2^10, carefully chosen */ -#define US_SCALE 32 /* 2^32, arbitralrily chosen */ -unsigned int cpu_khz; /* TSC clocks / usec, not used here */ -EXPORT_SYMBOL(cpu_khz); -static unsigned long hpet_period; /* fsecs / HPET clock */ -unsigned long hpet_tick; /* HPET clocks / interrupt */ -int hpet_use_timer; /* Use counter of hpet for time keeping, otherwise PIT */ -unsigned long vxtime_hz = PIT_TICK_RATE; int report_lost_ticks; /* command line option */ -unsigned long long monotonic_base; - -struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ - -volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -struct timespec __xtime __section_xtime; -struct timezone __sys_tz __section_sys_tz; - -/* - * do_gettimeoffset() returns microseconds since last timer interrupt was - * triggered by hardware. A memory read of HPET is slower than a register read - * of TSC, but much more reliable. It's also synchronized to the timer - * interrupt. Note that do_gettimeoffset() may return more than hpet_tick, if a - * timer interrupt has happened already, but vxtime.trigger wasn't updated yet. - * This is not a problem, because jiffies hasn't updated either. They are bound - * together by xtime_lock. - */ - -static inline unsigned int do_gettimeoffset_tsc(void) -{ - unsigned long t; - unsigned long x; - t = get_cycles_sync(); - if (t < vxtime.last_tsc) - t = vxtime.last_tsc; /* hack */ - x = ((t - vxtime.last_tsc) * vxtime.tsc_quot) >> US_SCALE; - return x; -} - -static inline unsigned int do_gettimeoffset_hpet(void) -{ - /* cap counter read to one tick to avoid inconsistencies */ - unsigned long counter = hpet_readl(HPET_COUNTER) - vxtime.last; - return (min(counter,hpet_tick) * vxtime.quot) >> US_SCALE; -} - -unsigned int (*do_gettimeoffset)(void) = do_gettimeoffset_tsc; - -/* - * This version of gettimeofday() has microsecond resolution and better than - * microsecond precision, as we're using at least a 10 MHz (usually 14.31818 - * MHz) HPET timer. - */ - -void do_gettimeofday(struct timeval *tv) -{ - unsigned long seq; - unsigned int sec, usec; - - do { - seq = read_seqbegin(&xtime_lock); - - sec = xtime.tv_sec; - usec = xtime.tv_nsec / NSEC_PER_USEC; - - /* i386 does some correction here to keep the clock - monotonous even when ntpd is fixing drift. - But they didn't work for me, there is a non monotonic - clock anyways with ntp. - I dropped all corrections now until a real solution can - be found. Note when you fix it here you need to do the same - in arch/x86_64/kernel/vsyscall.c and export all needed - variables in vmlinux.lds. -AK */ - usec += do_gettimeoffset(); - - } while (read_seqretry(&xtime_lock, seq)); - tv->tv_sec = sec + usec / USEC_PER_SEC; - tv->tv_usec = usec % USEC_PER_SEC; -} - -EXPORT_SYMBOL(do_gettimeofday); - -/* - * settimeofday() first undoes the correction that gettimeofday would do - * on the time, and then saves it. This is ugly, but has been like this for - * ages already. - */ - -int do_settimeofday(struct timespec *tv) -{ - time_t wtm_sec, sec = tv->tv_sec; - long wtm_nsec, nsec = tv->tv_nsec; - - if ((unsigned long)tv->tv_nsec >= NSEC_PER_SEC) - return -EINVAL; - - write_seqlock_irq(&xtime_lock); - - nsec -= do_gettimeoffset() * NSEC_PER_USEC; - - wtm_sec = wall_to_monotonic.tv_sec + (xtime.tv_sec - sec); - wtm_nsec = wall_to_monotonic.tv_nsec + (xtime.tv_nsec - nsec); - - set_normalized_timespec(&xtime, sec, nsec); - set_normalized_timespec(&wall_to_monotonic, wtm_sec, wtm_nsec); - - ntp_clear(); - - write_sequnlock_irq(&xtime_lock); - clock_was_set(); - return 0; -} - -EXPORT_SYMBOL(do_settimeofday); +volatile unsigned long jiffies = INITIAL_JIFFIES; unsigned long profile_pc(struct pt_regs *regs) { @@ -196,303 +82,19 @@ unsigned long profile_pc(struct pt_regs } EXPORT_SYMBOL(profile_pc); -/* - * In order to set the CMOS clock precisely, set_rtc_mmss has to be called 500 - * ms after the second nowtime has started, because when nowtime is written - * into the registers of the CMOS clock, it will jump to the next second - * precisely 500 ms later. Check the Motorola MC146818A or Dallas DS12887 data - * sheet for details. - */ - -static void set_rtc_mmss(unsigned long nowtime) -{ - int real_seconds, real_minutes, cmos_minutes; - unsigned char control, freq_select; - -/* - * IRQs are disabled when we're called from the timer interrupt, - * no need for spin_lock_irqsave() - */ - - spin_lock(&rtc_lock); - -/* - * Tell the clock it's being set and stop it. - */ - - control = CMOS_READ(RTC_CONTROL); - CMOS_WRITE(control | RTC_SET, RTC_CONTROL); - - freq_select = CMOS_READ(RTC_FREQ_SELECT); - CMOS_WRITE(freq_select | RTC_DIV_RESET2, RTC_FREQ_SELECT); - - cmos_minutes = CMOS_READ(RTC_MINUTES); - BCD_TO_BIN(cmos_minutes); - -/* - * since we're only adjusting minutes and seconds, don't interfere with hour - * overflow. This avoids messing with unknown time zones but requires your RTC - * not to be off by more than 15 minutes. Since we're calling it only when - * our clock is externally synchronized using NTP, this shouldn't be a problem. - */ - - real_seconds = nowtime % 60; - real_minutes = nowtime / 60; - if (((abs(real_minutes - cmos_minutes) + 15) / 30) & 1) - real_minutes += 30; /* correct for half hour time zone */ - real_minutes %= 60; - - if (abs(real_minutes - cmos_minutes) >= 30) { - printk(KERN_WARNING "time.c: can't update CMOS clock " - "from %d to %d\n", cmos_minutes, real_minutes); - } else { - BIN_TO_BCD(real_seconds); - BIN_TO_BCD(real_minutes); - CMOS_WRITE(real_seconds, RTC_SECONDS); - CMOS_WRITE(real_minutes, RTC_MINUTES); - } - -/* - * The following flags have to be released exactly in this order, otherwise the - * DS12887 (popular MC146818A clone with integrated battery and quartz) will - * not reset the oscillator and will not update precisely 500 ms later. You - * won't find this mentioned in the Dallas Semiconductor data sheets, but who - * believes data sheets anyway ... -- Markus Kuhn - */ - - CMOS_WRITE(control, RTC_CONTROL); - CMOS_WRITE(freq_select, RTC_FREQ_SELECT); - - spin_unlock(&rtc_lock); -} - - -/* monotonic_clock(): returns # of nanoseconds passed since time_init() - * Note: This function is required to return accurate - * time even in the absence of multiple timer ticks. - */ -static inline unsigned long long cycles_2_ns(unsigned long long cyc); -unsigned long long monotonic_clock(void) -{ - unsigned long seq; - u32 last_offset, this_offset, offset; - unsigned long long base; - - if (vxtime.mode == VXTIME_HPET) { - do { - seq = read_seqbegin(&xtime_lock); - - last_offset = vxtime.last; - base = monotonic_base; - this_offset = hpet_readl(HPET_COUNTER); - } while (read_seqretry(&xtime_lock, seq)); - offset = (this_offset - last_offset); - offset *= NSEC_PER_TICK / hpet_tick; - } else { - do { - seq = read_seqbegin(&xtime_lock); - - last_offset = vxtime.last_tsc; - base = monotonic_base; - } while (read_seqretry(&xtime_lock, seq)); - this_offset = get_cycles_sync(); - offset = cycles_2_ns(this_offset - last_offset); - } - return base + offset; -} -EXPORT_SYMBOL(monotonic_clock); - -static noinline void handle_lost_ticks(int lost) -{ - static long lost_count; - static int warned; - if (report_lost_ticks) { - printk(KERN_WARNING "time.c: Lost %d timer tick(s)! ", lost); - print_symbol("rip %s)\n", get_irq_regs()->rip); - } - - if (lost_count == 1000 && !warned) { - printk(KERN_WARNING "warning: many lost ticks.\n" - KERN_WARNING "Your time source seems to be instable or " - "some driver is hogging interupts\n"); - print_symbol("rip %s\n", get_irq_regs()->rip); - if (vxtime.mode == VXTIME_TSC && vxtime.hpet_address) { - printk(KERN_WARNING "Falling back to HPET\n"); - if (hpet_use_timer) - vxtime.last = hpet_readl(HPET_T0_CMP) - - hpet_tick; - else - vxtime.last = hpet_readl(HPET_COUNTER); - vxtime.mode = VXTIME_HPET; - do_gettimeoffset = do_gettimeoffset_hpet; - } - /* else should fall back to PIT, but code missing. */ - warned = 1; - } else - lost_count++; - -#ifdef CONFIG_CPU_FREQ - /* In some cases the CPU can change frequency without us noticing - Give cpufreq a change to catch up. */ - if ((lost_count+1) % 25 == 0) - cpufreq_delayed_get(); -#endif -} - void main_timer_handler(void) { - static unsigned long rtc_update = 0; - unsigned long tsc; - int delay = 0, offset = 0, lost = 0; - -/* - * Here we are in the timer irq handler. We have irqs locally disabled (so we - * don't need spin_lock_irqsave()) but we don't know if the timer_bh is running - * on the other CPU, so we need a lock. We also need to lock the vsyscall - * variables, because both do_timer() and us change them -arca+vojtech - */ - - write_seqlock(&xtime_lock); - - if (vxtime.hpet_address) - offset = hpet_readl(HPET_COUNTER); - - if (hpet_use_timer) { - /* if we're using the hpet timer functionality, - * we can more accurately know the counter value - * when the timer interrupt occured. - */ - offset = hpet_readl(HPET_T0_CMP) - hpet_tick; - delay = hpet_readl(HPET_COUNTER) - offset; - } else if (!pmtmr_ioport) { - spin_lock(&i8253_lock); - outb_p(0x00, 0x43); - delay = inb_p(0x40); - delay |= inb(0x40) << 8; - spin_unlock(&i8253_lock); - delay = LATCH - 1 - delay; - } - - tsc = get_cycles_sync(); - - if (vxtime.mode == VXTIME_HPET) { - if (offset - vxtime.last > hpet_tick) { - lost = (offset - vxtime.last) / hpet_tick - 1; - } - - monotonic_base += - (offset - vxtime.last) * NSEC_PER_TICK / hpet_tick; - - vxtime.last = offset; -#ifdef CONFIG_X86_PM_TIMER - } else if (vxtime.mode == VXTIME_PMTMR) { - lost = pmtimer_mark_offset(); -#endif - } else { - offset = (((tsc - vxtime.last_tsc) * - vxtime.tsc_quot) >> US_SCALE) - USEC_PER_TICK; - - if (offset < 0) - offset = 0; - - if (offset > USEC_PER_TICK) { - lost = offset / USEC_PER_TICK; - offset %= USEC_PER_TICK; - } - - monotonic_base += cycles_2_ns(tsc - vxtime.last_tsc); - - vxtime.last_tsc = tsc - vxtime.quot * delay / vxtime.tsc_quot; - - if ((((tsc - vxtime.last_tsc) * - vxtime.tsc_quot) >> US_SCALE) < offset) - vxtime.last_tsc = tsc - - (((long) offset << US_SCALE) / vxtime.tsc_quot) - 1; - } - - if (lost > 0) - handle_lost_ticks(lost); - else - lost = 0; - -/* - * Do the timer stuff. - */ - - do_timer(lost + 1); -#ifndef CONFIG_SMP - update_process_times(user_mode(get_irq_regs())); -#endif - -/* - * In the SMP case we use the local APIC timer interrupt to do the profiling, - * except when we simulate SMP mode on a uniprocessor system, in that case we - * have to call the local interrupt handler. - */ - - if (!using_apic_timer) - smp_local_timer_interrupt(); - -/* - * If we have an externally synchronized Linux clock, then update CMOS clock - * accordingly every ~11 minutes. set_rtc_mmss() will be called in the jiffy - * closest to exactly 500 ms before the next second. If the update fails, we - * don't care, as it'll be updated on the next turn, and the problem (time way - * off) isn't likely to go away much sooner anyway. - */ - - if (ntp_synced() && xtime.tv_sec > rtc_update && - abs(xtime.tv_nsec - 500000000) <= tick_nsec / 2) { - set_rtc_mmss(xtime.tv_sec); - rtc_update = xtime.tv_sec + 660; - } - - write_sequnlock(&xtime_lock); + pit_clockevent.event_handler(get_irq_regs()); } static irqreturn_t timer_interrupt(int irq, void *dev_id) { - if (apic_runs_main_timer > 1) - return IRQ_HANDLED; main_timer_handler(); if (using_apic_timer) smp_send_timer_broadcast_ipi(); return IRQ_HANDLED; } -static unsigned int cyc2ns_scale __read_mostly; - -static inline void set_cyc2ns_scale(unsigned long cpu_khz) -{ - cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / cpu_khz; -} - -static inline unsigned long long cycles_2_ns(unsigned long long cyc) -{ - return (cyc * cyc2ns_scale) >> NS_SCALE; -} - -unsigned long long sched_clock(void) -{ - unsigned long a = 0; - -#if 0 - /* Don't do a HPET read here. Using TSC always is much faster - and HPET may not be mapped yet when the scheduler first runs. - Disadvantage is a small drift between CPUs in some configurations, - but that should be tolerable. */ - if (__vxtime.mode == VXTIME_HPET) - return (hpet_readl(HPET_COUNTER) * vxtime.quot) >> US_SCALE; -#endif - - /* Could do CPU core sync here. Opteron can execute rdtsc speculatively, - which means it is not completely exact and may not be monotonous between - CPUs. But the errors should be too small to matter for scheduling - purposes. */ - - rdtscll(a); - return cycles_2_ns(a); -} static unsigned long get_cmos_time(void) { @@ -545,142 +147,6 @@ static unsigned long get_cmos_time(void) return mktime(year, mon, day, hour, min, sec); } -#ifdef CONFIG_CPU_FREQ - -/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency - changes. - - RED-PEN: On SMP we assume all CPUs run with the same frequency. It's - not that important because current Opteron setups do not support - scaling on SMP anyroads. - - Should fix up last_tsc too. Currently gettimeofday in the - first tick after the change will be slightly wrong. */ - -#include - -static unsigned int cpufreq_delayed_issched = 0; -static unsigned int cpufreq_init = 0; -static struct work_struct cpufreq_delayed_get_work; - -static void handle_cpufreq_delayed_get(void *v) -{ - unsigned int cpu; - for_each_online_cpu(cpu) { - cpufreq_get(cpu); - } - cpufreq_delayed_issched = 0; -} - -/* if we notice lost ticks, schedule a call to cpufreq_get() as it tries - * to verify the CPU frequency the timing core thinks the CPU is running - * at is still correct. - */ -static void cpufreq_delayed_get(void) -{ - static int warned; - if (cpufreq_init && !cpufreq_delayed_issched) { - cpufreq_delayed_issched = 1; - if (!warned) { - warned = 1; - printk(KERN_DEBUG - "Losing some ticks... checking if CPU frequency changed.\n"); - } - schedule_work(&cpufreq_delayed_get_work); - } -} - -static unsigned int ref_freq = 0; -static unsigned long loops_per_jiffy_ref = 0; - -static unsigned long cpu_khz_ref = 0; - -static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val, - void *data) -{ - struct cpufreq_freqs *freq = data; - unsigned long *lpj, dummy; - - if (cpu_has(&cpu_data[freq->cpu], X86_FEATURE_CONSTANT_TSC)) - return 0; - - lpj = &dummy; - if (!(freq->flags & CPUFREQ_CONST_LOOPS)) -#ifdef CONFIG_SMP - lpj = &cpu_data[freq->cpu].loops_per_jiffy; -#else - lpj = &boot_cpu_data.loops_per_jiffy; -#endif - - if (!ref_freq) { - ref_freq = freq->old; - loops_per_jiffy_ref = *lpj; - cpu_khz_ref = cpu_khz; - } - if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) || - (val == CPUFREQ_POSTCHANGE && freq->old > freq->new) || - (val == CPUFREQ_RESUMECHANGE)) { - *lpj = - cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new); - - cpu_khz = cpufreq_scale(cpu_khz_ref, ref_freq, freq->new); - if (!(freq->flags & CPUFREQ_CONST_LOOPS)) - vxtime.tsc_quot = (USEC_PER_MSEC << US_SCALE) / cpu_khz; - } - - set_cyc2ns_scale(cpu_khz_ref); - - return 0; -} - -static struct notifier_block time_cpufreq_notifier_block = { - .notifier_call = time_cpufreq_notifier -}; - -static int __init cpufreq_tsc(void) -{ - INIT_WORK(&cpufreq_delayed_get_work, handle_cpufreq_delayed_get, NULL); - if (!cpufreq_register_notifier(&time_cpufreq_notifier_block, - CPUFREQ_TRANSITION_NOTIFIER)) - cpufreq_init = 1; - return 0; -} - -core_initcall(cpufreq_tsc); - -#endif - -/* - * calibrate_tsc() calibrates the processor TSC in a very simple way, comparing - * it to the HPET timer of known frequency. - */ - -#define TICK_COUNT 100000000 - -static unsigned int __init hpet_calibrate_tsc(void) -{ - int tsc_start, hpet_start; - int tsc_now, hpet_now; - unsigned long flags; - - local_irq_save(flags); - local_irq_disable(); - - hpet_start = hpet_readl(HPET_COUNTER); - rdtscl(tsc_start); - - do { - local_irq_disable(); - hpet_now = hpet_readl(HPET_COUNTER); - tsc_now = get_cycles_sync(); - local_irq_restore(flags); - } while ((tsc_now - tsc_start) < TICK_COUNT && - (hpet_now - hpet_start) < TICK_COUNT); - - return (tsc_now - tsc_start) * 1000000000L - / ((hpet_now - hpet_start) * hpet_period / 1000); -} - /* * pit_calibrate_tsc() uses the speaker output (channel 2) of @@ -711,138 +177,84 @@ static unsigned int __init pit_calibrate return (end - start) / 50; } -#ifdef CONFIG_HPET -static __init int late_hpet_init(void) -{ - struct hpet_data hd; - unsigned int ntimer; - - if (!vxtime.hpet_address) - return 0; - - memset(&hd, 0, sizeof (hd)); - - ntimer = hpet_readl(HPET_ID); - ntimer = (ntimer & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT; - ntimer++; - - /* - * Register with driver. - * Timer0 and Timer1 is used by platform. - */ - hd.hd_phys_address = vxtime.hpet_address; - hd.hd_address = (void __iomem *)fix_to_virt(FIX_HPET_BASE); - hd.hd_nirqs = ntimer; - hd.hd_flags = HPET_DATA_PLATFORM; - hpet_reserve_timer(&hd, 0); -#ifdef CONFIG_HPET_EMULATE_RTC - hpet_reserve_timer(&hd, 1); -#endif - hd.hd_irq[0] = HPET_LEGACY_8254; - hd.hd_irq[1] = HPET_LEGACY_RTC; - if (ntimer > 2) { - struct hpet *hpet; - struct hpet_timer *timer; - int i; - - hpet = (struct hpet *) fix_to_virt(FIX_HPET_BASE); - timer = &hpet->hpet_timers[2]; - for (i = 2; i < ntimer; timer++, i++) - hd.hd_irq[i] = (timer->hpet_config & - Tn_INT_ROUTE_CNF_MASK) >> - Tn_INT_ROUTE_CNF_SHIFT; +#define PIT_MODE 0x43 +#define PIT_CH0 0x40 - } +static void __init __pit_init(int val, u8 mode) +{ + unsigned long flags; - hpet_alloc(&hd); - return 0; + spin_lock_irqsave(&i8253_lock, flags); + outb_p(mode, PIT_MODE); + outb_p(val & 0xff, PIT_CH0); /* LSB */ + outb_p(val >> 8, PIT_CH0); /* MSB */ + spin_unlock_irqrestore(&i8253_lock, flags); } -fs_initcall(late_hpet_init); -#endif -static int hpet_timer_stop_set_go(unsigned long tick) +static void init_pit_timer(enum clock_event_mode mode, + struct clock_event_device *evt) { - unsigned int cfg; - -/* - * Stop the timers and reset the main counter. - */ + unsigned long flags; - cfg = hpet_readl(HPET_CFG); - cfg &= ~(HPET_CFG_ENABLE | HPET_CFG_LEGACY); - hpet_writel(cfg, HPET_CFG); - hpet_writel(0, HPET_COUNTER); - hpet_writel(0, HPET_COUNTER + 4); + spin_lock_irqsave(&i8253_lock, flags); -/* - * Set up timer 0, as periodic with first interrupt to happen at hpet_tick, - * and period also hpet_tick. - */ - if (hpet_use_timer) { - hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | - HPET_TN_32BIT, HPET_T0_CFG); - hpet_writel(hpet_tick, HPET_T0_CMP); /* next interrupt */ - hpet_writel(hpet_tick, HPET_T0_CMP); /* period */ - cfg |= HPET_CFG_LEGACY; + switch(mode) { + case CLOCK_EVT_PERIODIC: + /* binary, mode 2, LSB/MSB, ch 0 */ + outb_p(0x34, PIT_MODE); + udelay(10); + outb_p(LATCH & 0xff , PIT_CH0); /* LSB */ + outb(LATCH >> 8 , PIT_CH0); /* MSB */ + break; + + case CLOCK_EVT_ONESHOT: + /* One shot setup */ + outb_p(0x38, PIT_MODE); + udelay(10); + break; + case CLOCK_EVT_SHUTDOWN: + outb_p(0x30, PIT_MODE); + outb_p(0, PIT_CH0); /* LSB */ + outb_p(0, PIT_CH0); /* MSB */ + disable_irq(0); + break; } -/* - * Go! - */ - - cfg |= HPET_CFG_ENABLE; - hpet_writel(cfg, HPET_CFG); - - return 0; + spin_unlock_irqrestore(&i8253_lock, flags); } -static int hpet_init(void) +static void pit_next_event(unsigned long delta, struct clock_event_device *evt) { - unsigned int id; - - if (!vxtime.hpet_address) - return -1; - set_fixmap_nocache(FIX_HPET_BASE, vxtime.hpet_address); - __set_fixmap(VSYSCALL_HPET, vxtime.hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE); - -/* - * Read the period, compute tick and quotient. - */ - - id = hpet_readl(HPET_ID); - - if (!(id & HPET_ID_VENDOR) || !(id & HPET_ID_NUMBER)) - return -1; - - hpet_period = hpet_readl(HPET_PERIOD); - if (hpet_period < 100000 || hpet_period > 100000000) - return -1; - - hpet_tick = (FSEC_PER_TICK + hpet_period / 2) / hpet_period; - - hpet_use_timer = (id & HPET_ID_LEGSUP); - - return hpet_timer_stop_set_go(hpet_tick); -} + unsigned long flags; -static int hpet_reenable(void) -{ - return hpet_timer_stop_set_go(hpet_tick); + spin_lock_irqsave(&i8253_lock, flags); + outb_p(delta & 0xff , PIT_CH0); /* LSB */ + outb(delta >> 8 , PIT_CH0); /* MSB */ + spin_unlock_irqrestore(&i8253_lock, flags); } -#define PIT_MODE 0x43 -#define PIT_CH0 0x40 +struct clock_event_device pit_clockevent = { + .name = "pit", + .capabilities = CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE +#ifndef CONFIG_SMP + | CLOCK_CAP_NEXTEVT +#endif + , + .set_mode = init_pit_timer, + .set_next_event = pit_next_event, + .shift = 32, +}; -static void __init __pit_init(int val, u8 mode) +void setup_pit_timer(void) { - unsigned long flags; - - spin_lock_irqsave(&i8253_lock, flags); - outb_p(mode, PIT_MODE); - outb_p(val & 0xff, PIT_CH0); /* LSB */ - outb_p(val >> 8, PIT_CH0); /* MSB */ - spin_unlock_irqrestore(&i8253_lock, flags); + pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32); + pit_clockevent.max_delta_ns = + clockevent_delta2ns(0x7FFF, &pit_clockevent); + pit_clockevent.min_delta_ns = + clockevent_delta2ns(0xF, &pit_clockevent); + register_global_clockevent(&pit_clockevent); } + void __init pit_init(void) { __pit_init(LATCH, 0x34); /* binary, mode 2, LSB/MSB, ch 0 */ @@ -856,9 +268,9 @@ void __init pit_stop_interrupt(void) void __init stop_timer_interrupt(void) { char *name; - if (vxtime.hpet_address) { + if (hpet_address) { name = "HPET"; - hpet_timer_stop_set_go(0); + hpet_stop(); } else { name = "PIT"; pit_stop_interrupt(); @@ -873,127 +285,40 @@ int __init time_setup(char *str) } static struct irqaction irq0 = { - timer_interrupt, IRQF_DISABLED, CPU_MASK_NONE, "timer", NULL, NULL + timer_interrupt, IRQF_DISABLED | IRQF_NODELAY, CPU_MASK_NONE, "timer", NULL, NULL }; void __init time_init(void) { if (nohpet) - vxtime.hpet_address = 0; - + hpet_address = 0; xtime.tv_sec = get_cmos_time(); xtime.tv_nsec = 0; set_normalized_timespec(&wall_to_monotonic, -xtime.tv_sec, -xtime.tv_nsec); - if (!hpet_init()) - vxtime_hz = (FSEC_PER_SEC + hpet_period / 2) / hpet_period; - else - vxtime.hpet_address = 0; + if (hpet_arch_init()) + hpet_address = 0; + + setup_pit_timer(); if (hpet_use_timer) { /* set tick_nsec to use the proper rate for HPET */ tick_nsec = TICK_NSEC_HPET; cpu_khz = hpet_calibrate_tsc(); timename = "HPET"; -#ifdef CONFIG_X86_PM_TIMER - } else if (pmtmr_ioport && !vxtime.hpet_address) { - vxtime_hz = PM_TIMER_FREQUENCY; - timename = "PM"; - pit_init(); - cpu_khz = pit_calibrate_tsc(); -#endif } else { pit_init(); cpu_khz = pit_calibrate_tsc(); timename = "PIT"; } - vxtime.mode = VXTIME_TSC; - vxtime.quot = (USEC_PER_SEC << US_SCALE) / vxtime_hz; - vxtime.tsc_quot = (USEC_PER_MSEC << US_SCALE) / cpu_khz; - vxtime.last_tsc = get_cycles_sync(); - set_cyc2ns_scale(cpu_khz); - setup_irq(0, &irq0); - -#ifndef CONFIG_SMP - time_init_gtod(); -#endif -} - -/* - * Make an educated guess if the TSC is trustworthy and synchronized - * over all CPUs. - */ -__cpuinit int unsynchronized_tsc(void) -{ -#ifdef CONFIG_SMP - if (apic_is_clustered_box()) - return 1; -#endif - /* Most intel systems have synchronized TSCs except for - multi node systems */ - if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) { -#ifdef CONFIG_ACPI - /* But TSC doesn't tick in C3 so don't use it there */ - if (acpi_fadt.length > 0 && acpi_fadt.plvl3_lat < 1000) - return 1; -#endif - return 0; - } - - /* Assume multi socket systems are not synchronized */ - return num_present_cpus() > 1; -} - -/* - * Decide what mode gettimeofday should use. - */ -void time_init_gtod(void) -{ - char *timetype; - if (unsynchronized_tsc()) - notsc = 1; - - if (cpu_has(&boot_cpu_data, X86_FEATURE_RDTSCP)) - vgetcpu_mode = VGETCPU_RDTSCP; - else - vgetcpu_mode = VGETCPU_LSL; - - if (vxtime.hpet_address && notsc) { - timetype = hpet_use_timer ? "HPET" : "PIT/HPET"; - if (hpet_use_timer) - vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick; - else - vxtime.last = hpet_readl(HPET_COUNTER); - vxtime.mode = VXTIME_HPET; - do_gettimeoffset = do_gettimeoffset_hpet; -#ifdef CONFIG_X86_PM_TIMER - /* Using PM for gettimeofday is quite slow, but we have no other - choice because the TSC is too unreliable on some systems. */ - } else if (pmtmr_ioport && !vxtime.hpet_address && notsc) { - timetype = "PM"; - do_gettimeoffset = do_gettimeoffset_pm; - vxtime.mode = VXTIME_PMTMR; - sysctl_vsyscall = 0; - printk(KERN_INFO "Disabling vsyscall due to use of PM timer\n"); -#endif - } else { - timetype = hpet_use_timer ? "HPET/TSC" : "PIT/TSC"; - vxtime.mode = VXTIME_TSC; - } - - printk(KERN_INFO "time.c: Using %ld.%06ld MHz WALL %s GTOD %s timer.\n", - vxtime_hz / 1000000, vxtime_hz % 1000000, timename, timetype); - printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n", - cpu_khz / 1000, cpu_khz % 1000); - vxtime.quot = (USEC_PER_SEC << US_SCALE) / vxtime_hz; - vxtime.tsc_quot = (USEC_PER_MSEC << US_SCALE) / cpu_khz; - vxtime.last_tsc = get_cycles_sync(); + mark_tsc_unstable(); set_cyc2ns_scale(cpu_khz); + setup_irq(0, &irq0); } __setup("report_lost_ticks", time_setup); @@ -1020,42 +345,11 @@ static int timer_suspend(struct sys_devi static int timer_resume(struct sys_device *dev) { - unsigned long flags; - unsigned long sec; - unsigned long ctime = get_cmos_time(); - long sleep_length = (ctime - sleep_start) * HZ; - - if (sleep_length < 0) { - printk(KERN_WARNING "Time skew detected in timer resume!\n"); - /* The time after the resume must not be earlier than the time - * before the suspend or some nasty things will happen - */ - sleep_length = 0; - ctime = sleep_start; - } - if (vxtime.hpet_address) + if (hpet_address) hpet_reenable(); else i8254_timer_resume(); - sec = ctime + clock_cmos_diff; - write_seqlock_irqsave(&xtime_lock,flags); - xtime.tv_sec = sec; - xtime.tv_nsec = 0; - if (vxtime.mode == VXTIME_HPET) { - if (hpet_use_timer) - vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick; - else - vxtime.last = hpet_readl(HPET_COUNTER); -#ifdef CONFIG_X86_PM_TIMER - } else if (vxtime.mode == VXTIME_PMTMR) { - pmtimer_resume(); -#endif - } else - vxtime.last_tsc = get_cycles_sync(); - write_sequnlock_irqrestore(&xtime_lock,flags); - jiffies += sleep_length; - monotonic_base += sleep_length * (NSEC_PER_SEC/HZ); touch_softlockup_watchdog(); return 0; } @@ -1081,269 +375,3 @@ static int time_init_device(void) } device_initcall(time_init_device); - -#ifdef CONFIG_HPET_EMULATE_RTC -/* HPET in LegacyReplacement Mode eats up RTC interrupt line. When, HPET - * is enabled, we support RTC interrupt functionality in software. - * RTC has 3 kinds of interrupts: - * 1) Update Interrupt - generate an interrupt, every sec, when RTC clock - * is updated - * 2) Alarm Interrupt - generate an interrupt at a specific time of day - * 3) Periodic Interrupt - generate periodic interrupt, with frequencies - * 2Hz-8192Hz (2Hz-64Hz for non-root user) (all freqs in powers of 2) - * (1) and (2) above are implemented using polling at a frequency of - * 64 Hz. The exact frequency is a tradeoff between accuracy and interrupt - * overhead. (DEFAULT_RTC_INT_FREQ) - * For (3), we use interrupts at 64Hz or user specified periodic - * frequency, whichever is higher. - */ -#include - -#define DEFAULT_RTC_INT_FREQ 64 -#define RTC_NUM_INTS 1 - -static unsigned long UIE_on; -static unsigned long prev_update_sec; - -static unsigned long AIE_on; -static struct rtc_time alarm_time; - -static unsigned long PIE_on; -static unsigned long PIE_freq = DEFAULT_RTC_INT_FREQ; -static unsigned long PIE_count; - -static unsigned long hpet_rtc_int_freq; /* RTC interrupt frequency */ -static unsigned int hpet_t1_cmp; /* cached comparator register */ - -int is_hpet_enabled(void) -{ - return vxtime.hpet_address != 0; -} - -/* - * Timer 1 for RTC, we do not use periodic interrupt feature, - * even if HPET supports periodic interrupts on Timer 1. - * The reason being, to set up a periodic interrupt in HPET, we need to - * stop the main counter. And if we do that everytime someone diables/enables - * RTC, we will have adverse effect on main kernel timer running on Timer 0. - * So, for the time being, simulate the periodic interrupt in software. - * - * hpet_rtc_timer_init() is called for the first time and during subsequent - * interuppts reinit happens through hpet_rtc_timer_reinit(). - */ -int hpet_rtc_timer_init(void) -{ - unsigned int cfg, cnt; - unsigned long flags; - - if (!is_hpet_enabled()) - return 0; - /* - * Set the counter 1 and enable the interrupts. - */ - if (PIE_on && (PIE_freq > DEFAULT_RTC_INT_FREQ)) - hpet_rtc_int_freq = PIE_freq; - else - hpet_rtc_int_freq = DEFAULT_RTC_INT_FREQ; - - local_irq_save(flags); - - cnt = hpet_readl(HPET_COUNTER); - cnt += ((hpet_tick*HZ)/hpet_rtc_int_freq); - hpet_writel(cnt, HPET_T1_CMP); - hpet_t1_cmp = cnt; - - cfg = hpet_readl(HPET_T1_CFG); - cfg &= ~HPET_TN_PERIODIC; - cfg |= HPET_TN_ENABLE | HPET_TN_32BIT; - hpet_writel(cfg, HPET_T1_CFG); - - local_irq_restore(flags); - - return 1; -} - -static void hpet_rtc_timer_reinit(void) -{ - unsigned int cfg, cnt, ticks_per_int, lost_ints; - - if (unlikely(!(PIE_on | AIE_on | UIE_on))) { - cfg = hpet_readl(HPET_T1_CFG); - cfg &= ~HPET_TN_ENABLE; - hpet_writel(cfg, HPET_T1_CFG); - return; - } - - if (PIE_on && (PIE_freq > DEFAULT_RTC_INT_FREQ)) - hpet_rtc_int_freq = PIE_freq; - else - hpet_rtc_int_freq = DEFAULT_RTC_INT_FREQ; - - /* It is more accurate to use the comparator value than current count.*/ - ticks_per_int = hpet_tick * HZ / hpet_rtc_int_freq; - hpet_t1_cmp += ticks_per_int; - hpet_writel(hpet_t1_cmp, HPET_T1_CMP); - - /* - * If the interrupt handler was delayed too long, the write above tries - * to schedule the next interrupt in the past and the hardware would - * not interrupt until the counter had wrapped around. - * So we have to check that the comparator wasn't set to a past time. - */ - cnt = hpet_readl(HPET_COUNTER); - if (unlikely((int)(cnt - hpet_t1_cmp) > 0)) { - lost_ints = (cnt - hpet_t1_cmp) / ticks_per_int + 1; - /* Make sure that, even with the time needed to execute - * this code, the next scheduled interrupt has been moved - * back to the future: */ - lost_ints++; - - hpet_t1_cmp += lost_ints * ticks_per_int; - hpet_writel(hpet_t1_cmp, HPET_T1_CMP); - - if (PIE_on) - PIE_count += lost_ints; - - printk(KERN_WARNING "rtc: lost some interrupts at %ldHz.\n", - hpet_rtc_int_freq); - } -} - -/* - * The functions below are called from rtc driver. - * Return 0 if HPET is not being used. - * Otherwise do the necessary changes and return 1. - */ -int hpet_mask_rtc_irq_bit(unsigned long bit_mask) -{ - if (!is_hpet_enabled()) - return 0; - - if (bit_mask & RTC_UIE) - UIE_on = 0; - if (bit_mask & RTC_PIE) - PIE_on = 0; - if (bit_mask & RTC_AIE) - AIE_on = 0; - - return 1; -} - -int hpet_set_rtc_irq_bit(unsigned long bit_mask) -{ - int timer_init_reqd = 0; - - if (!is_hpet_enabled()) - return 0; - - if (!(PIE_on | AIE_on | UIE_on)) - timer_init_reqd = 1; - - if (bit_mask & RTC_UIE) { - UIE_on = 1; - } - if (bit_mask & RTC_PIE) { - PIE_on = 1; - PIE_count = 0; - } - if (bit_mask & RTC_AIE) { - AIE_on = 1; - } - - if (timer_init_reqd) - hpet_rtc_timer_init(); - - return 1; -} - -int hpet_set_alarm_time(unsigned char hrs, unsigned char min, unsigned char sec) -{ - if (!is_hpet_enabled()) - return 0; - - alarm_time.tm_hour = hrs; - alarm_time.tm_min = min; - alarm_time.tm_sec = sec; - - return 1; -} - -int hpet_set_periodic_freq(unsigned long freq) -{ - if (!is_hpet_enabled()) - return 0; - - PIE_freq = freq; - PIE_count = 0; - - return 1; -} - -int hpet_rtc_dropped_irq(void) -{ - if (!is_hpet_enabled()) - return 0; - - return 1; -} - -irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs) -{ - struct rtc_time curr_time; - unsigned long rtc_int_flag = 0; - int call_rtc_interrupt = 0; - - hpet_rtc_timer_reinit(); - - if (UIE_on | AIE_on) { - rtc_get_rtc_time(&curr_time); - } - if (UIE_on) { - if (curr_time.tm_sec != prev_update_sec) { - /* Set update int info, call real rtc int routine */ - call_rtc_interrupt = 1; - rtc_int_flag = RTC_UF; - prev_update_sec = curr_time.tm_sec; - } - } - if (PIE_on) { - PIE_count++; - if (PIE_count >= hpet_rtc_int_freq/PIE_freq) { - /* Set periodic int info, call real rtc int routine */ - call_rtc_interrupt = 1; - rtc_int_flag |= RTC_PF; - PIE_count = 0; - } - } - if (AIE_on) { - if ((curr_time.tm_sec == alarm_time.tm_sec) && - (curr_time.tm_min == alarm_time.tm_min) && - (curr_time.tm_hour == alarm_time.tm_hour)) { - /* Set alarm int info, call real rtc int routine */ - call_rtc_interrupt = 1; - rtc_int_flag |= RTC_AF; - } - } - if (call_rtc_interrupt) { - rtc_int_flag |= (RTC_IRQF | (RTC_NUM_INTS << 8)); - rtc_interrupt(rtc_int_flag, dev_id); - } - return IRQ_HANDLED; -} -#endif - -static int __init nohpet_setup(char *s) -{ - nohpet = 1; - return 1; -} - -__setup("nohpet", nohpet_setup); - -int __init notsc_setup(char *s) -{ - notsc = 1; - return 1; -} - -__setup("notsc", notsc_setup); Index: linux/arch/x86_64/kernel/traps.c =================================================================== --- linux.orig/arch/x86_64/kernel/traps.c +++ linux/arch/x86_64/kernel/traps.c @@ -70,19 +70,19 @@ asmlinkage void alignment_check(void); asmlinkage void machine_check(void); asmlinkage void spurious_interrupt_bug(void); -ATOMIC_NOTIFIER_HEAD(die_chain); +RAW_NOTIFIER_HEAD(die_chain); EXPORT_SYMBOL(die_chain); int register_die_notifier(struct notifier_block *nb) { vmalloc_sync_all(); - return atomic_notifier_chain_register(&die_chain, nb); + return raw_notifier_chain_register(&die_chain, nb); } EXPORT_SYMBOL(register_die_notifier); /* used modular by kdb */ int unregister_die_notifier(struct notifier_block *nb) { - return atomic_notifier_chain_unregister(&die_chain, nb); + return raw_notifier_chain_unregister(&die_chain, nb); } EXPORT_SYMBOL(unregister_die_notifier); /* used modular by kdb */ @@ -251,7 +251,7 @@ static inline int valid_stack_ptr(struct void dump_trace(struct task_struct *tsk, struct pt_regs *regs, unsigned long * stack, struct stacktrace_ops *ops, void *data) { - const unsigned cpu = smp_processor_id(); + const unsigned cpu = raw_smp_processor_id(); unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr; unsigned used = 0; struct thread_info *tinfo; @@ -419,6 +419,8 @@ show_trace(struct task_struct *tsk, stru printk("\nCall Trace:\n"); dump_trace(tsk, regs, stack, &print_trace_ops, NULL); printk("\n"); + debug_show_held_locks(tsk); + print_traces(tsk); } static void @@ -426,7 +428,7 @@ _show_stack(struct task_struct *tsk, str { unsigned long *stack; int i; - const int cpu = smp_processor_id(); + const int cpu = raw_smp_processor_id(); unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr); unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE); @@ -549,19 +551,20 @@ void out_of_line_bug(void) EXPORT_SYMBOL(out_of_line_bug); #endif -static DEFINE_SPINLOCK(die_lock); +static DEFINE_RAW_SPINLOCK(die_lock); static int die_owner = -1; static unsigned int die_nest_count; unsigned __kprobes long oops_begin(void) { - int cpu = smp_processor_id(); unsigned long flags; + int cpu; oops_enter(); /* racy, but better than risking deadlock. */ local_irq_save(flags); + cpu = smp_processor_id(); if (!spin_trylock(&die_lock)) { if (cpu == die_owner) /* nested oops. should stop eventually */; Index: linux/arch/x86_64/kernel/tsc.c =================================================================== --- /dev/null +++ linux/arch/x86_64/kernel/tsc.c @@ -0,0 +1,233 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define NS_SCALE 10 /* 2^10, carefully chosen */ +#define US_SCALE 32 /* 2^32, arbitralrily chosen */ + +static int notsc __initdata = 0; + +unsigned int cpu_khz; /* TSC clocks / usec, not used here */ +EXPORT_SYMBOL(cpu_khz); + +static unsigned int cyc2ns_scale __read_mostly; + +void set_cyc2ns_scale(unsigned long khz) +{ + cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz; +} + +static inline unsigned long long cycles_2_ns(unsigned long long cyc) +{ + return (cyc * cyc2ns_scale) >> NS_SCALE; +} + +unsigned long long sched_clock(void) +{ + unsigned long a = 0; + + /* Could do CPU core sync here. Opteron can execute rdtsc speculatively, + which means it is not completely exact and may not be monotonous between + CPUs. But the errors should be too small to matter for scheduling + purposes. */ + + rdtscll(a); + return cycles_2_ns(a); +} + +static int tsc_unstable; + +static inline int check_tsc_unstable(void) +{ + return tsc_unstable; +} + +void mark_tsc_unstable(void) +{ + tsc_unstable = 1; +} +EXPORT_SYMBOL_GPL(mark_tsc_unstable); + +#ifdef CONFIG_CPU_FREQ + +/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency + changes. + + RED-PEN: On SMP we assume all CPUs run with the same frequency. It's + not that important because current Opteron setups do not support + scaling on SMP anyroads. + + Should fix up last_tsc too. Currently gettimeofday in the + first tick after the change will be slightly wrong. */ + +#include + +static unsigned int cpufreq_delayed_issched = 0; +static unsigned int cpufreq_init = 0; +static struct work_struct cpufreq_delayed_get_work; + +static void handle_cpufreq_delayed_get(void *v) +{ + unsigned int cpu; + for_each_online_cpu(cpu) { + cpufreq_get(cpu); + } + cpufreq_delayed_issched = 0; +} + +static unsigned int ref_freq = 0; +static unsigned long loops_per_jiffy_ref = 0; + +static unsigned long cpu_khz_ref = 0; + +static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val, + void *data) +{ + struct cpufreq_freqs *freq = data; + unsigned long *lpj, dummy; + + if (cpu_has(&cpu_data[freq->cpu], X86_FEATURE_CONSTANT_TSC)) + return 0; + + lpj = &dummy; + if (!(freq->flags & CPUFREQ_CONST_LOOPS)) +#ifdef CONFIG_SMP + lpj = &cpu_data[freq->cpu].loops_per_jiffy; +#else + lpj = &boot_cpu_data.loops_per_jiffy; +#endif + + if (!ref_freq) { + ref_freq = freq->old; + loops_per_jiffy_ref = *lpj; + cpu_khz_ref = cpu_khz; + } + if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) || + (val == CPUFREQ_POSTCHANGE && freq->old > freq->new) || + (val == CPUFREQ_RESUMECHANGE)) { + *lpj = + cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new); + + cpu_khz = cpufreq_scale(cpu_khz_ref, ref_freq, freq->new); + if (!(freq->flags & CPUFREQ_CONST_LOOPS)) + mark_tsc_unstable(); + } + + set_cyc2ns_scale(cpu_khz_ref); + + return 0; +} + +static struct notifier_block time_cpufreq_notifier_block = { + .notifier_call = time_cpufreq_notifier +}; + +static int __init cpufreq_tsc(void) +{ + INIT_WORK(&cpufreq_delayed_get_work, handle_cpufreq_delayed_get, NULL); + if (!cpufreq_register_notifier(&time_cpufreq_notifier_block, + CPUFREQ_TRANSITION_NOTIFIER)) + cpufreq_init = 1; + return 0; +} + +/* + * init_cpufreq_transition_notifier_list() should execute first, + * which is a core_initcall, so mark this one core_initcall_sync: + */ +core_initcall_sync(cpufreq_tsc); + +#endif +/* + * Make an educated guess if the TSC is trustworthy and synchronized + * over all CPUs. + */ +__cpuinit int unsynchronized_tsc(void) +{ +#ifdef CONFIG_SMP + if (apic_is_clustered_box()) + return 1; +#endif + /* Most intel systems have synchronized TSCs except for + multi node systems */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) { +#ifdef CONFIG_ACPI + /* But TSC doesn't tick in C3 so don't use it there */ + if (acpi_fadt.length > 0 && acpi_fadt.plvl3_lat < 100) + return 1; +#endif + return 0; + } + + /* Assume multi socket systems are not synchronized */ + return num_present_cpus() > 1; +} + +int __init notsc_setup(char *s) +{ + notsc = 1; + return 1; +} + +__setup("notsc", notsc_setup); + + +/* clock source code: */ + +static int tsc_update_callback(void); + +static notrace cycle_t read_tsc(void) +{ + cycle_t ret = (cycle_t)get_cycles_sync(); + return ret; +} + +static cycle_t __vsyscall_fn vread_tsc(void) +{ + cycle_t ret = (cycle_t)get_cycles_sync(); + return ret; +} + +static struct clocksource clocksource_tsc = { + .name = "tsc", + .rating = 300, + .read = read_tsc, + .mask = (cycle_t)-1, + .mult = 0, /* to be set */ + .shift = 22, + .update_callback = tsc_update_callback, + .is_continuous = 1, + .vread = vread_tsc, +}; + +static int tsc_update_callback(void) +{ + int change = 0; + + /* check to see if we should switch to the safe clocksource: */ + if (clocksource_tsc.rating != 50 && check_tsc_unstable()) { + clocksource_tsc.rating = 50; + clocksource_reselect(); + change = 1; + } + return change; +} + +static int __init init_tsc_clocksource(void) +{ + if (!notsc) { + clocksource_tsc.mult = clocksource_khz2mult(cpu_khz, + clocksource_tsc.shift); + return clocksource_register(&clocksource_tsc); + } + return 0; +} + +module_init(init_tsc_clocksource); Index: linux/arch/x86_64/kernel/tsc_sync.c =================================================================== --- /dev/null +++ linux/arch/x86_64/kernel/tsc_sync.c @@ -0,0 +1,187 @@ +/* + * arch/x86_64/kernel/tsc_sync.c: check TSC synchronization. + * + * Copyright (C) 2006, Red Hat, Inc., Ingo Molnar + * + * We check whether all boot CPUs have their TSC's synchronized, + * print a warning if not and turn off the TSC clock-source. + * + * The warp-check is point-to-point between two CPUs, the CPU + * initiating the bootup is the 'source CPU', the freshly booting + * CPU is the 'target CPU'. + * + * Only two CPUs may participate - they can enter in any order. + * ( The serial nature of the boot logic and the CPU hotplug lock + * protects against more than 2 CPUs entering this code. ) + */ +#include +#include +#include +#include +#include +#include + +/* + * Entry/exit counters that make sure that both CPUs + * run the measurement code at once: + */ +static __cpuinitdata atomic_t start_count; +static __cpuinitdata atomic_t stop_count; + +/* + * We use a raw spinlock in this exceptional case, because + * we want to have the fastest, inlined, non-debug version + * of a critical section, to be able to prove TSC time-warps: + */ +static __cpuinitdata __raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED; +static __cpuinitdata cycles_t last_tsc; +static __cpuinitdata cycles_t max_warp; +static __cpuinitdata int nr_warps; + +/* + * TSC-warp measurement loop running on both CPUs: + */ +static __cpuinit void check_tsc_warp(void) +{ + cycles_t start, now, prev, end; + int i; + + start = get_cycles_sync(); + /* + * The measurement runs for 20 msecs: + */ + end = start + cpu_khz * 20ULL; + now = start; + + for (i = 0; ; i++) { + /* + * We take the global lock, measure TSC, save the + * previous TSC that was measured (possibly on + * another CPU) and update the previous TSC timestamp. + */ + __raw_spin_lock(&sync_lock); + prev = last_tsc; + now = get_cycles_sync(); + last_tsc = now; + __raw_spin_unlock(&sync_lock); + + /* + * Be nice every now and then (and also check whether + * measurement is done [we also insert a 100 million + * loops safety exit, so we dont lock up in case the + * TSC readout is totally broken]): + */ + if (unlikely(!(i & 7))) { + if (now > end || i > 100000000) + break; + cpu_relax(); + touch_nmi_watchdog(); + } + /* + * Outside the critical section we can now see whether + * we saw a time-warp of the TSC going backwards: + */ + if (unlikely(prev > now)) { + __raw_spin_lock(&sync_lock); + max_warp = max(max_warp, prev - now); + nr_warps++; + __raw_spin_unlock(&sync_lock); + } + + } +} + +/* + * Source CPU calls into this - it waits for the freshly booted + * target CPU to arrive and then starts the measurement: + */ +void __cpuinit check_tsc_sync_source(int cpu) +{ + int cpus = 2; + + /* + * No need to check if we already know that the TSC is not + * synchronized: + */ + if (unsynchronized_tsc()) + return; + + printk(KERN_INFO "checking TSC synchronization [CPU#%d -> CPU#%d]:", + smp_processor_id(), cpu); + + /* + * Reset it - in case this is a second bootup: + */ + atomic_set(&stop_count, 0); + + /* + * Wait for the target to arrive: + */ + while (atomic_read(&start_count) != cpus-1) + cpu_relax(); + /* + * Trigger the target to continue into the measurement too: + */ + atomic_inc(&start_count); + + check_tsc_warp(); + + while (atomic_read(&stop_count) != cpus-1) + cpu_relax(); + + /* + * Reset it - just in case we boot another CPU later: + */ + atomic_set(&start_count, 0); + + if (nr_warps) { + printk("\n"); + printk(KERN_WARNING "Measured %Ld cycles TSC warp between CPUs," + " turning off TSC clock.\n", max_warp); + mark_tsc_unstable(); + nr_warps = 0; + max_warp = 0; + last_tsc = 0; + } else { + printk(" passed.\n"); + } + + /* + * Let the target continue with the bootup: + */ + atomic_inc(&stop_count); +} + +/* + * Freshly booted CPUs call into this: + */ +void __cpuinit check_tsc_sync_target(void) +{ + int cpus = 2; + + if (unsynchronized_tsc()) + return; + + /* + * Register this CPU's participation and wait for the + * source CPU to start the measurement: + */ + atomic_inc(&start_count); + while (atomic_read(&start_count) != cpus) + cpu_relax(); + + check_tsc_warp(); + + /* + * Ok, we are done: + */ + atomic_inc(&stop_count); + + /* + * Wait for the source CPU to print stuff: + */ + while (atomic_read(&stop_count) != cpus) + cpu_relax(); +} +#undef NR_LOOPS + Index: linux/arch/x86_64/kernel/vmlinux.lds.S =================================================================== --- linux.orig/arch/x86_64/kernel/vmlinux.lds.S +++ linux/arch/x86_64/kernel/vmlinux.lds.S @@ -94,27 +94,16 @@ SECTIONS __vsyscall_0 = VSYSCALL_VIRT_ADDR; . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) } - xtime_lock = VVIRT(.xtime_lock); - - .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) } - vxtime = VVIRT(.vxtime); + .vsyscall_fn : AT(VLOAD(.vsyscall_fn)) { *(.vsyscall_fn) } + . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); .vgetcpu_mode : AT(VLOAD(.vgetcpu_mode)) { *(.vgetcpu_mode) } vgetcpu_mode = VVIRT(.vgetcpu_mode); - .sys_tz : AT(VLOAD(.sys_tz)) { *(.sys_tz) } - sys_tz = VVIRT(.sys_tz); - - .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) } - sysctl_vsyscall = VVIRT(.sysctl_vsyscall); - - .xtime : AT(VLOAD(.xtime)) { *(.xtime) } - xtime = VVIRT(.xtime); - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) } - jiffies = VVIRT(.jiffies); + .vsyscall_gtod_data : AT(VLOAD(.vsyscall_gtod_data)) { *(.vsyscall_gtod_data) } + vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data); + .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) } Index: linux/arch/x86_64/kernel/vsyscall.c =================================================================== --- linux.orig/arch/x86_64/kernel/vsyscall.c +++ linux/arch/x86_64/kernel/vsyscall.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -36,65 +37,56 @@ #include #include #include -#include +#include #include #include #include -#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) - -int __sysctl_vsyscall __section_sysctl_vsyscall = 1; -seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; -int __vgetcpu_mode __section_vgetcpu_mode; - -#include +#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) notrace -static __always_inline void timeval_normalize(struct timeval * tv) -{ - time_t __sec; +struct vsyscall_gtod_data_t { + raw_seqlock_t lock; + int sysctl_enabled; + struct timeval wall_time_tv; + struct timezone sys_tz; + cycle_t offset_base; + struct clocksource clock; + unsigned long jiffies; + int vgetcpu_mode; +}; - __sec = tv->tv_usec / 1000000; - if (__sec) { - tv->tv_usec %= 1000000; - tv->tv_sec += __sec; - } -} +struct vsyscall_gtod_data_t __vsyscall_gtod_data __section_vsyscall_gtod_data = { + .lock = __RAW_SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock), + .sysctl_enabled = 1, + .jiffies = INITIAL_JIFFIES, +}; -static __always_inline void do_vgettimeofday(struct timeval * tv) +void update_vsyscall(struct timespec* wall_time, struct clocksource* clock) { - long sequence, t; - unsigned long sec, usec; + unsigned long flags; - do { - sequence = read_seqbegin(&__xtime_lock); - - sec = __xtime.tv_sec; - usec = __xtime.tv_nsec / 1000; - - if (__vxtime.mode != VXTIME_HPET) { - t = get_cycles_sync(); - if (t < __vxtime.last_tsc) - t = __vxtime.last_tsc; - usec += ((t - __vxtime.last_tsc) * - __vxtime.tsc_quot) >> 32; - /* See comment in x86_64 do_gettimeofday. */ - } else { - usec += ((readl((void __iomem *) - fix_to_virt(VSYSCALL_HPET) + 0xf0) - - __vxtime.last) * __vxtime.quot) >> 32; - } - } while (read_seqretry(&__xtime_lock, sequence)); + write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags); + /* copy vsyscall data */ + vsyscall_gtod_data.clock = *clock; + vsyscall_gtod_data.wall_time_tv.tv_sec = wall_time->tv_sec; + vsyscall_gtod_data.wall_time_tv.tv_usec = wall_time->tv_nsec/1000; + vsyscall_gtod_data.sys_tz = sys_tz; + + if (cpu_has(&boot_cpu_data, X86_FEATURE_RDTSCP)) + vsyscall_gtod_data.vgetcpu_mode = VGETCPU_RDTSCP; + else + vsyscall_gtod_data.vgetcpu_mode = VGETCPU_LSL; - tv->tv_sec = sec + usec / 1000000; - tv->tv_usec = usec % 1000000; + write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags); } /* RED-PEN may want to readd seq locking, but then the variable should be write-once. */ static __always_inline void do_get_tz(struct timezone * tz) { - *tz = __sys_tz; + *tz = __vsyscall_gtod_data.sys_tz; } + static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz) { int ret; @@ -113,10 +105,44 @@ static __always_inline long time_syscall return secs; } +static __always_inline void do_vgettimeofday(struct timeval * tv) +{ + cycle_t now, base, mask, cycle_delta; + unsigned long seq, mult, shift, nsec_delta; + cycle_t (*vread)(void); + do { + seq = read_seqbegin(&__vsyscall_gtod_data.lock); + + vread = __vsyscall_gtod_data.clock.vread; + if (unlikely(!__vsyscall_gtod_data.sysctl_enabled || !vread)) { + gettimeofday(tv,0); + return; + } + now = vread(); + base = __vsyscall_gtod_data.clock.cycle_last; + mask = __vsyscall_gtod_data.clock.mask; + mult = __vsyscall_gtod_data.clock.mult; + shift = __vsyscall_gtod_data.clock.shift; + + *tv = __vsyscall_gtod_data.wall_time_tv; + + } while (read_seqretry(&__vsyscall_gtod_data.lock, seq)); + + /* calculate interval: */ + cycle_delta = (now - base) & mask; + /* convert to nsecs: */ + nsec_delta = (cycle_delta * mult) >> shift; + + /* convert to usecs and add to timespec: */ + tv->tv_usec += nsec_delta / NSEC_PER_USEC; + while (tv->tv_usec > USEC_PER_SEC) { + tv->tv_sec += 1; + tv->tv_usec -= USEC_PER_SEC; + } +} + int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz) { - if (!__sysctl_vsyscall) - return gettimeofday(tv,tz); if (tv) do_vgettimeofday(tv); if (tz) @@ -128,11 +154,11 @@ int __vsyscall(0) vgettimeofday(struct t * unlikely */ time_t __vsyscall(1) vtime(time_t *t) { - if (!__sysctl_vsyscall) + if (unlikely(!__vsyscall_gtod_data.sysctl_enabled)) return time_syscall(t); else if (t) - *t = __xtime.tv_sec; - return __xtime.tv_sec; + *t = __vsyscall_gtod_data.wall_time_tv.tv_sec; + return __vsyscall_gtod_data.wall_time_tv.tv_sec; } /* Fast way to get current CPU and node. @@ -157,9 +183,9 @@ vgetcpu(unsigned *cpu, unsigned *node, s We do this here because otherwise user space would do it on its own in a likely inferior way (no access to jiffies). If you don't like it pass NULL. */ - if (tcache && tcache->blob[0] == (j = __jiffies)) { + if (tcache && tcache->blob[0] == (j = __vsyscall_gtod_data.jiffies)) { p = tcache->blob[1]; - } else if (__vgetcpu_mode == VGETCPU_RDTSCP) { + } else if (__vsyscall_gtod_data.vgetcpu_mode == VGETCPU_RDTSCP) { /* Load per CPU data from RDTSCP */ rdtscp(dummy, dummy, p); } else { @@ -209,7 +235,7 @@ static int vsyscall_sysctl_change(ctl_ta ret = -ENOMEM; goto out; } - if (!sysctl_vsyscall) { + if (!vsyscall_gtod_data.sysctl_enabled) { writew(SYSCALL, map1); writew(SYSCALL, map2); } else { @@ -232,7 +258,7 @@ static int vsyscall_sysctl_nostrat(ctl_t static ctl_table kernel_table2[] = { { .ctl_name = 99, .procname = "vsyscall64", - .data = &sysctl_vsyscall, .maxlen = sizeof(int), .mode = 0644, + .data = &vsyscall_gtod_data.sysctl_enabled, .maxlen = sizeof(int), .mode = 0644, .strategy = vsyscall_sysctl_nostrat, .proc_handler = vsyscall_sysctl_change }, { 0, } @@ -274,7 +300,6 @@ static void __cpuinit cpu_vsyscall_init( vsyscall_set_cpu(raw_smp_processor_id()); } -#ifdef CONFIG_HOTPLUG_CPU static int __cpuinit cpu_vsyscall_notifier(struct notifier_block *n, unsigned long action, void *arg) { @@ -283,7 +308,6 @@ cpu_vsyscall_notifier(struct notifier_bl smp_call_function_single(cpu, cpu_vsyscall_init, NULL, 0, 1); return NOTIFY_DONE; } -#endif static void __init map_vsyscall(void) { Index: linux/arch/x86_64/kernel/x8664_ksyms.c =================================================================== --- linux.orig/arch/x86_64/kernel/x8664_ksyms.c +++ linux/arch/x86_64/kernel/x8664_ksyms.c @@ -11,10 +11,12 @@ EXPORT_SYMBOL(kernel_thread); -EXPORT_SYMBOL(__down_failed); -EXPORT_SYMBOL(__down_failed_interruptible); -EXPORT_SYMBOL(__down_failed_trylock); -EXPORT_SYMBOL(__up_wakeup); +#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK +EXPORT_SYMBOL(__compat_down_failed); +EXPORT_SYMBOL(__compat_down_failed_interruptible); +EXPORT_SYMBOL(__compat_down_failed_trylock); +EXPORT_SYMBOL(__compat_up_wakeup); +#endif EXPORT_SYMBOL(__get_user_1); EXPORT_SYMBOL(__get_user_2); Index: linux/arch/x86_64/lib/thunk.S =================================================================== --- linux.orig/arch/x86_64/lib/thunk.S +++ linux/arch/x86_64/lib/thunk.S @@ -40,11 +40,13 @@ thunk rwsem_wake_thunk,rwsem_wake thunk rwsem_downgrade_thunk,rwsem_downgrade_wake #endif - - thunk __down_failed,__down - thunk_retrax __down_failed_interruptible,__down_interruptible - thunk_retrax __down_failed_trylock,__down_trylock - thunk __up_wakeup,__up + +#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK + thunk __compat_down_failed,__compat_down + thunk_retrax __compat_down_failed_interruptible,__compat_down_interruptible + thunk_retrax __compat_down_failed_trylock,__compat_down_trylock + thunk __compat_up_wakeup,__compat_up +#endif #ifdef CONFIG_TRACE_IRQFLAGS thunk trace_hardirqs_on_thunk,trace_hardirqs_on Index: linux/arch/x86_64/mm/fault.c =================================================================== --- linux.orig/arch/x86_64/mm/fault.c +++ linux/arch/x86_64/mm/fault.c @@ -73,6 +73,7 @@ void bust_spinlocks(int yes) { int loglevel_save = console_loglevel; if (yes) { + stop_trace(); oops_in_progress = 1; } else { #ifdef CONFIG_VT Index: linux/arch/x86_64/mm/init.c =================================================================== --- linux.orig/arch/x86_64/mm/init.c +++ linux/arch/x86_64/mm/init.c @@ -51,7 +51,7 @@ EXPORT_SYMBOL(dma_ops); static unsigned long dma_reserve __initdata; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* * NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the Index: linux/block/ll_rw_blk.c =================================================================== --- linux.orig/block/ll_rw_blk.c +++ linux/block/ll_rw_blk.c @@ -1547,7 +1547,7 @@ static int ll_merge_requests_fn(request_ */ void blk_plug_device(request_queue_t *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); /* * don't plug a stopped queue, it must be paired with blk_start_queue() @@ -1570,7 +1570,7 @@ EXPORT_SYMBOL(blk_plug_device); */ int blk_remove_plug(request_queue_t *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) return 0; @@ -1662,7 +1662,7 @@ static void blk_unplug_timeout(unsigned **/ void blk_start_queue(request_queue_t *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); clear_bit(QUEUE_FLAG_STOPPED, &q->queue_flags); @@ -3374,8 +3374,6 @@ static void blk_done_softirq(struct soft } } -#ifdef CONFIG_HOTPLUG_CPU - static int blk_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) { @@ -3401,8 +3399,6 @@ static struct notifier_block __devinitda .notifier_call = blk_cpu_notify, }; -#endif /* CONFIG_HOTPLUG_CPU */ - /** * blk_complete_request - end I/O on a request * @req: the request being processed Index: linux/drivers/Kconfig =================================================================== --- linux.orig/drivers/Kconfig +++ linux/drivers/Kconfig @@ -78,4 +78,6 @@ source "drivers/rtc/Kconfig" source "drivers/dma/Kconfig" +source "drivers/kvm/Kconfig" + endmenu Index: linux/drivers/Makefile =================================================================== --- linux.orig/drivers/Makefile +++ linux/drivers/Makefile @@ -77,3 +77,4 @@ obj-$(CONFIG_CRYPTO) += crypto/ obj-$(CONFIG_SUPERH) += sh/ obj-$(CONFIG_GENERIC_TIME) += clocksource/ obj-$(CONFIG_DMA_ENGINE) += dma/ +obj-$(CONFIG_KVM) += kvm/ Index: linux/drivers/acpi/ec.c =================================================================== --- linux.orig/drivers/acpi/ec.c +++ linux/drivers/acpi/ec.c @@ -459,6 +459,7 @@ static u32 acpi_ec_gpe_handler(void *dat if (acpi_ec_mode == EC_INTR) { if (acpi_ec_check_status(value, ec->expect_event)) { ec->expect_event = 0; + acpi_enable_gpe(NULL, ec->gpe_bit, ACPI_ISR); wake_up(&ec->wait); } } @@ -468,7 +469,8 @@ static u32 acpi_ec_gpe_handler(void *dat return status == AE_OK ? ACPI_INTERRUPT_HANDLED : ACPI_INTERRUPT_NOT_HANDLED; } - acpi_enable_gpe(NULL, ec->gpe_bit, ACPI_ISR); + if (acpi_ec_mode != EC_INTR) + acpi_enable_gpe(NULL, ec->gpe_bit, ACPI_ISR); return status == AE_OK ? ACPI_INTERRUPT_HANDLED : ACPI_INTERRUPT_NOT_HANDLED; } Index: linux/drivers/acpi/events/evmisc.c =================================================================== --- linux.orig/drivers/acpi/events/evmisc.c +++ linux/drivers/acpi/events/evmisc.c @@ -331,7 +331,6 @@ static void ACPI_SYSTEM_XFACE acpi_ev_gl static u32 acpi_ev_global_lock_handler(void *context) { u8 acquired = FALSE; - acpi_status status; /* * Attempt to get the lock Index: linux/drivers/acpi/executer/exmutex.c =================================================================== --- linux.orig/drivers/acpi/executer/exmutex.c +++ linux/drivers/acpi/executer/exmutex.c @@ -267,9 +267,9 @@ acpi_ex_release_mutex(union acpi_operand && (obj_desc->mutex.os_mutex != ACPI_GLOBAL_LOCK)) { ACPI_ERROR((AE_INFO, "Thread %X cannot release Mutex [%4.4s] acquired by thread %X", - (u32) walk_state->thread->thread_id, + (u32)(long) walk_state->thread->thread_id, acpi_ut_get_node_name(obj_desc->mutex.node), - (u32) obj_desc->mutex.owner_thread->thread_id)); + (u32)(long) obj_desc->mutex.owner_thread->thread_id)); return_ACPI_STATUS(AE_AML_NOT_OWNER); } Index: linux/drivers/acpi/hardware/hwregs.c =================================================================== --- linux.orig/drivers/acpi/hardware/hwregs.c +++ linux/drivers/acpi/hardware/hwregs.c @@ -75,7 +75,7 @@ acpi_status acpi_hw_clear_acpi_status(u3 ACPI_BITMASK_ALL_FIXED_STATUS, (u16) acpi_gbl_FADT->xpm1a_evt_blk.address)); - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); status = acpi_hw_register_write(ACPI_MTX_DO_NOT_LOCK, ACPI_REGISTER_PM1_STATUS, @@ -100,7 +100,7 @@ acpi_status acpi_hw_clear_acpi_status(u3 status = acpi_ev_walk_gpe_list(acpi_hw_clear_gpe_block); unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); return_ACPI_STATUS(status); } @@ -339,7 +339,7 @@ acpi_status acpi_set_register(u32 regist return_ACPI_STATUS(AE_BAD_PARAMETER); } - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); /* Always do a register read first so we can insert the new bits */ @@ -447,7 +447,7 @@ acpi_status acpi_set_register(u32 regist unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); /* Normalize the value that was read */ @@ -487,7 +487,7 @@ acpi_hw_register_read(u8 use_lock, u32 r ACPI_FUNCTION_TRACE(hw_register_read); if (ACPI_MTX_LOCK == use_lock) { - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); } switch (register_id) { @@ -565,7 +565,7 @@ acpi_hw_register_read(u8 use_lock, u32 r unlock_and_exit: if (ACPI_MTX_LOCK == use_lock) { - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); } if (ACPI_SUCCESS(status)) { @@ -611,7 +611,7 @@ acpi_status acpi_hw_register_write(u8 us ACPI_FUNCTION_TRACE(hw_register_write); if (ACPI_MTX_LOCK == use_lock) { - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); } switch (register_id) { @@ -734,7 +734,7 @@ acpi_status acpi_hw_register_write(u8 us unlock_and_exit: if (ACPI_MTX_LOCK == use_lock) { - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); } return_ACPI_STATUS(status); Index: linux/drivers/acpi/namespace/nsinit.c =================================================================== --- linux.orig/drivers/acpi/namespace/nsinit.c +++ linux/drivers/acpi/namespace/nsinit.c @@ -45,6 +45,7 @@ #include #include #include +#include #define _COMPONENT ACPI_NAMESPACE ACPI_MODULE_NAME("nsinit") @@ -537,7 +538,15 @@ acpi_ns_init_one_device(acpi_handle obj_ info->parameter_type = ACPI_PARAM_ARGS; info->flags = ACPI_IGNORE_RETURN_VALUE; + /* + * Some hardware relies on this being executed as atomically + * as possible (without an NMI being received in the middle of + * this) - so disable NMIs and initialize the device: + */ + acpi_nmi_disable(); status = acpi_ns_evaluate(info); + acpi_nmi_enable(); + if (ACPI_SUCCESS(status)) { walk_info->num_INI++; Index: linux/drivers/acpi/osl.c =================================================================== --- linux.orig/drivers/acpi/osl.c +++ linux/drivers/acpi/osl.c @@ -676,13 +676,13 @@ void acpi_os_delete_lock(acpi_spinlock h acpi_status acpi_os_create_semaphore(u32 max_units, u32 initial_units, acpi_handle * handle) { - struct semaphore *sem = NULL; + struct compat_semaphore *sem = NULL; - sem = acpi_os_allocate(sizeof(struct semaphore)); + sem = acpi_os_allocate(sizeof(struct compat_semaphore)); if (!sem) return AE_NO_MEMORY; - memset(sem, 0, sizeof(struct semaphore)); + memset(sem, 0, sizeof(struct compat_semaphore)); sema_init(sem, initial_units); @@ -705,7 +705,7 @@ EXPORT_SYMBOL(acpi_os_create_semaphore); acpi_status acpi_os_delete_semaphore(acpi_handle handle) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem) @@ -733,7 +733,7 @@ EXPORT_SYMBOL(acpi_os_delete_semaphore); acpi_status acpi_os_wait_semaphore(acpi_handle handle, u32 units, u16 timeout) { acpi_status status = AE_OK; - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; int ret = 0; @@ -820,7 +820,7 @@ EXPORT_SYMBOL(acpi_os_wait_semaphore); */ acpi_status acpi_os_signal_semaphore(acpi_handle handle, u32 units) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem || (units < 1)) Index: linux/drivers/acpi/processor_idle.c =================================================================== --- linux.orig/drivers/acpi/processor_idle.c +++ linux/drivers/acpi/processor_idle.c @@ -40,8 +40,19 @@ #include /* need_resched() */ #include +/* + * Include the apic definitions for x86 to have the APIC timer related defines + * available also for UP (on SMP it gets magically included via linux/smp.h). + * asm/acpi.h is not an option, as it would require more include magic. Also + * creating an empty asm-ia64/apic.h would just trade pest vs. cholera. + */ +#ifdef CONFIG_X86 +#include +#endif + #include #include +#include #include #include @@ -236,6 +247,65 @@ static void acpi_cstate_enter(struct acp } } +#ifdef ARCH_APICTIMER_STOPS_ON_C3 + +/* + * Some BIOS implementations switch to C3 in the published C2 state. This seems + * to be a common problem on AMD boxen. + */ +static void acpi_timer_check_state(int state, struct acpi_processor *pr, + struct acpi_processor_cx *cx) +{ + struct acpi_processor_power *pwr = &pr->power; + + /* + * Check, if one of the previous states already marked the lapic + * unstable + */ + if (pwr->timer_broadcast_on_state < state) + return; + + if(cx->type == ACPI_STATE_C3 || + boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { + pr->power.timer_broadcast_on_state = state; + return; + } +} + +static void acpi_propagate_timer_broadcast(struct acpi_processor *pr) +{ + cpumask_t mask = cpumask_of_cpu(pr->id); + + if (pr->power.timer_broadcast_on_state < INT_MAX) + on_each_cpu(switch_APIC_timer_to_ipi, &mask, 1, 1); + else + on_each_cpu(switch_ipi_to_APIC_timer, &mask, 1, 1); +} + +/* Power(C) State timer broadcast control */ +static void acpi_state_timer_broadcast(struct acpi_processor *pr, + struct acpi_processor_cx *cx, + int broadcast) +{ + int state = cx - pr->power.states; + + if (state >= pr->power.timer_broadcast_on_state) + lapic_timer_idle_broadcast(broadcast); +} + +#else + +static void acpi_timer_check_state(int state, struct acpi_processor *pr, + struct acpi_processor_cx *cstate) { } +static void acpi_propagate_timer_broadcast(struct acpi_processor *pr) { } +static void acpi_state_timer_broadcast(struct acpi_processor *pr, + struct acpi_processor_cx *cx, + int broadcast) +{ +} + +#endif + static void acpi_processor_idle(void) { struct acpi_processor *pr = NULL; @@ -378,20 +448,24 @@ static void acpi_processor_idle(void) /* Get start time (ticks) */ t1 = inl(acpi_fadt.xpm_tmr_blk.address); /* Invoke C2 */ + acpi_state_timer_broadcast(pr, cx, 1); acpi_cstate_enter(cx); /* Get end time (ticks) */ t2 = inl(acpi_fadt.xpm_tmr_blk.address); +#ifndef CONFIG_IA64 #ifdef CONFIG_GENERIC_TIME /* TSC halts in C2, so notify users */ mark_tsc_unstable(); #endif +#endif /* Re-enable interrupts */ local_irq_enable(); current_thread_info()->status |= TS_POLLING; /* Compute time (ticks) that we were actually asleep */ sleep_ticks = ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD; + acpi_state_timer_broadcast(pr, cx, 0); break; case ACPI_STATE_C3: @@ -414,6 +488,7 @@ static void acpi_processor_idle(void) /* Get start time (ticks) */ t1 = inl(acpi_fadt.xpm_tmr_blk.address); /* Invoke C3 */ + acpi_state_timer_broadcast(pr, cx, 1); acpi_cstate_enter(cx); /* Get end time (ticks) */ t2 = inl(acpi_fadt.xpm_tmr_blk.address); @@ -424,16 +499,19 @@ static void acpi_processor_idle(void) ACPI_MTX_DO_NOT_LOCK); } +#ifndef CONFIG_IA64 #ifdef CONFIG_GENERIC_TIME /* TSC halts in C3, so notify users */ mark_tsc_unstable(); #endif +#endif /* Re-enable interrupts */ local_irq_enable(); current_thread_info()->status |= TS_POLLING; /* Compute time (ticks) that we were actually asleep */ sleep_ticks = ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD; + acpi_state_timer_broadcast(pr, cx, 0); break; default: @@ -902,11 +980,7 @@ static int acpi_processor_power_verify(s unsigned int i; unsigned int working = 0; -#ifdef ARCH_APICTIMER_STOPS_ON_C3 - int timer_broadcast = 0; - cpumask_t mask = cpumask_of_cpu(pr->id); - on_each_cpu(switch_ipi_to_APIC_timer, &mask, 1, 1); -#endif + pr->power.timer_broadcast_on_state = INT_MAX; for (i = 1; i < ACPI_PROCESSOR_MAX_POWER; i++) { struct acpi_processor_cx *cx = &pr->power.states[i]; @@ -918,21 +992,14 @@ static int acpi_processor_power_verify(s case ACPI_STATE_C2: acpi_processor_power_verify_c2(cx); -#ifdef ARCH_APICTIMER_STOPS_ON_C3 - /* Some AMD systems fake C3 as C2, but still - have timer troubles */ - if (cx->valid && - boot_cpu_data.x86_vendor == X86_VENDOR_AMD) - timer_broadcast++; -#endif + if (cx->valid) + acpi_timer_check_state(i, pr, cx); break; case ACPI_STATE_C3: acpi_processor_power_verify_c3(pr, cx); -#ifdef ARCH_APICTIMER_STOPS_ON_C3 if (cx->valid) - timer_broadcast++; -#endif + acpi_timer_check_state(i, pr, cx); break; } @@ -940,10 +1007,7 @@ static int acpi_processor_power_verify(s working++; } -#ifdef ARCH_APICTIMER_STOPS_ON_C3 - if (timer_broadcast) - on_each_cpu(switch_APIC_timer_to_ipi, &mask, 1, 1); -#endif + acpi_propagate_timer_broadcast(pr); return (working); } Index: linux/drivers/acpi/tables/tbrsdt.c =================================================================== --- linux.orig/drivers/acpi/tables/tbrsdt.c +++ linux/drivers/acpi/tables/tbrsdt.c @@ -188,7 +188,7 @@ acpi_status acpi_tb_validate_rsdt(struct if (table_ptr->length < sizeof(struct acpi_table_header)) { ACPI_ERROR((AE_INFO, "RSDT/XSDT length (%X) is smaller than minimum (%zX)", - table_ptr->length, + (unsigned int)table_ptr->length, sizeof(struct acpi_table_header))); return (AE_INVALID_TABLE_LENGTH); Index: linux/drivers/acpi/utilities/utmutex.c =================================================================== --- linux.orig/drivers/acpi/utilities/utmutex.c +++ linux/drivers/acpi/utilities/utmutex.c @@ -116,7 +116,7 @@ void acpi_ut_mutex_terminate(void) /* Delete the spinlocks */ acpi_os_delete_lock(acpi_gbl_gpe_lock); - acpi_os_delete_lock(acpi_gbl_hardware_lock); +// acpi_os_delete_lock(acpi_gbl_hardware_lock); return_VOID; } @@ -259,7 +259,7 @@ acpi_status acpi_ut_acquire_mutex(acpi_m } else { ACPI_EXCEPTION((AE_INFO, status, "Thread %X could not acquire Mutex [%X]", - (u32) this_thread_id, mutex_id)); + (u32)(long) this_thread_id, mutex_id)); } return (status); Index: linux/drivers/ata/libata-core.c =================================================================== --- linux.orig/drivers/ata/libata-core.c +++ linux/drivers/ata/libata-core.c @@ -5526,8 +5526,8 @@ int ata_device_add(const struct ata_prob } /* obtain irq, that may be shared between channels */ - rc = request_irq(ent->irq, ent->port_ops->irq_handler, ent->irq_flags, - DRV_NAME, host); + rc = request_irq(ent->irq, ent->port_ops->irq_handler, + ent->irq_flags | SA_SHIRQ, DRV_NAME, host); if (rc) { dev_printk(KERN_ERR, dev, "irq %lu request failed: %d\n", ent->irq, rc); @@ -5540,8 +5540,8 @@ int ata_device_add(const struct ata_prob so trap it now */ BUG_ON(ent->irq == ent->irq2); - rc = request_irq(ent->irq2, ent->port_ops->irq_handler, ent->irq_flags, - DRV_NAME, host); + rc = request_irq(ent->irq2, ent->port_ops->irq_handler, + ent->irq_flags | SA_SHIRQ, DRV_NAME, host); if (rc) { dev_printk(KERN_ERR, dev, "irq %lu request failed: %d\n", ent->irq2, rc); Index: linux/drivers/ata/libata-scsi.c =================================================================== --- linux.orig/drivers/ata/libata-scsi.c +++ linux/drivers/ata/libata-scsi.c @@ -3128,7 +3128,8 @@ void ata_scsi_hotplug(void *data) for (i = 0; i < ATA_MAX_DEVICES; i++) { struct ata_device *dev = &ap->device[i]; if (ata_dev_enabled(dev) && !dev->sdev) { - queue_delayed_work(ata_aux_wq, &ap->hotplug_task, HZ); + queue_delayed_work(ata_aux_wq, &ap->hotplug_task, + round_jiffies_relative(HZ)); break; } } Index: linux/drivers/block/floppy.c =================================================================== --- linux.orig/drivers/block/floppy.c +++ linux/drivers/block/floppy.c @@ -4157,6 +4157,28 @@ static void floppy_device_release(struct complete(&device_release); } +static int floppy_suspend(struct platform_device *dev, pm_message_t state) +{ + floppy_release_irq_and_dma(); + + return 0; +} + +static int floppy_resume(struct platform_device *dev) +{ + floppy_grab_irq_and_dma(); + + return 0; +} + +static struct platform_driver floppy_driver = { + .suspend = floppy_suspend, + .resume = floppy_resume, + .driver = { + .name = "floppy", + }, +}; + static struct platform_device floppy_device[N_DRIVE]; static struct kobject *floppy_find(dev_t dev, int *part, void *data) @@ -4205,10 +4227,14 @@ static int __init floppy_init(void) if (err) goto out_put_disk; + err = platform_driver_register(&floppy_driver); + if (err) + goto out_unreg_blkdev; + floppy_queue = blk_init_queue(do_fd_request, &floppy_lock); if (!floppy_queue) { err = -ENOMEM; - goto out_unreg_blkdev; + goto out_unreg_driver; } blk_queue_max_sectors(floppy_queue, 64); @@ -4352,6 +4378,8 @@ out_flush_work: out_unreg_region: blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); blk_cleanup_queue(floppy_queue); +out_unreg_driver: + platform_driver_unregister(&floppy_driver); out_unreg_blkdev: unregister_blkdev(FLOPPY_MAJOR, "fd"); out_put_disk: @@ -4543,6 +4571,7 @@ void cleanup_module(void) init_completion(&device_release); blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); unregister_blkdev(FLOPPY_MAJOR, "fd"); + platform_driver_unregister(&floppy_driver); for (drive = 0; drive < N_DRIVE; drive++) { del_timer_sync(&motor_off_timer[drive]); Index: linux/drivers/block/paride/pseudo.h =================================================================== --- linux.orig/drivers/block/paride/pseudo.h +++ linux/drivers/block/paride/pseudo.h @@ -43,7 +43,7 @@ static unsigned long ps_timeout; static int ps_tq_active = 0; static int ps_nice = 0; -static DEFINE_SPINLOCK(ps_spinlock __attribute__((unused))); +static __attribute__((unused)) DEFINE_SPINLOCK(ps_spinlock); static DECLARE_WORK(ps_tq, ps_tq_int, NULL); Index: linux/drivers/char/Kconfig =================================================================== --- linux.orig/drivers/char/Kconfig +++ linux/drivers/char/Kconfig @@ -733,6 +733,46 @@ config RTC To compile this driver as a module, choose M here: the module will be called rtc. +config RTC_HISTOGRAM + bool "Real Time Clock Histogram Support" + default n + depends on RTC + ---help--- + If you say Y here then the kernel will track the delivery and + wakeup latency of /dev/rtc using tasks and will report a + histogram to the kernel log when the application closes /dev/rtc. + +config BLOCKER + tristate "Priority Inheritance Debugging (Blocker) Device Support" + depends on X86 + default y + ---help--- + If you say Y here then a device will be created that the userspace + pi_test suite uses to test and measure kernel locking primitives. + +config LPPTEST + tristate "Parallel Port Based Latency Measurement Device" + depends on !PARPORT && X86 + default y + ---help--- + If you say Y here then a device will be created that the userspace + testlpp utility uses to measure IRQ latencies of a target system + from an independent measurement system. + + NOTE: this code assumes x86 PCs and that the parallel port is + bidirectional and is on IRQ 7. + + to use the device, both the target and the source system needs to + run a kernel with CONFIG_LPPTEST enabled. To measure latencies, + use the scripts/testlpp utility in your kernel source directory, + and run it (as root) on the source system - it will start printing + out the latencies it took to get a response from the target system: + + Latency of response: 12.2 usecs (121265 cycles) + + then generate various workloads on the target system to see how + (worst-case-) latencies are impacted. + config SGI_DS1286 tristate "SGI DS1286 RTC support" depends on SGI_IP22 Index: linux/drivers/char/Makefile =================================================================== --- linux.orig/drivers/char/Makefile +++ linux/drivers/char/Makefile @@ -91,6 +91,9 @@ obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu. obj-$(CONFIG_TANBAC_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_BLOCKER) += blocker.o +obj-$(CONFIG_LPPTEST) += lpptest.o + obj-$(CONFIG_WATCHDOG) += watchdog/ obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ Index: linux/drivers/char/blocker.c =================================================================== --- /dev/null +++ linux/drivers/char/blocker.c @@ -0,0 +1,107 @@ +/* + * priority inheritance testing device + */ + +#include +#include + +#define BLOCKER_MINOR 221 + +#define BLOCK_IOCTL 4245 +#define BLOCK_SET_DEPTH 4246 + +#define BLOCKER_MAX_LOCK_DEPTH 10 + +void loop(int loops) +{ + int i; + + for (i = 0; i < loops; i++) + get_cycles(); +} + +static spinlock_t blocker_lock[BLOCKER_MAX_LOCK_DEPTH]; + +static unsigned int lock_depth = 1; + +void do_the_lock_and_loop(unsigned int args) +{ + int i, max; + + if (rt_task(current)) + max = lock_depth; + else if (lock_depth > 1) + max = (current->pid % lock_depth) + 1; + else + max = 1; + + /* Always lock from the top down */ + for (i = max-1; i >= 0; i--) + spin_lock(&blocker_lock[i]); + loop(args); + for (i = 0; i < max; i++) + spin_unlock(&blocker_lock[i]); +} + +static int blocker_open(struct inode *in, struct file *file) +{ + printk(KERN_INFO "blocker_open called\n"); + + return 0; +} + +static long blocker_ioctl(struct file *file, + unsigned int cmd, unsigned long args) +{ + switch(cmd) { + case BLOCK_IOCTL: + do_the_lock_and_loop(args); + return 0; + case BLOCK_SET_DEPTH: + if (args >= BLOCKER_MAX_LOCK_DEPTH) + return -EINVAL; + lock_depth = args; + return 0; + default: + return -EINVAL; + } +} + +static struct file_operations blocker_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .unlocked_ioctl = blocker_ioctl, + .open = blocker_open, +}; + +static struct miscdevice blocker_dev = +{ + BLOCKER_MINOR, + "blocker", + &blocker_fops +}; + +static int __init blocker_init(void) +{ + int i; + + if (misc_register(&blocker_dev)) + return -ENODEV; + + for (i = 0; i < BLOCKER_MAX_LOCK_DEPTH; i++) + spin_lock_init(blocker_lock + i); + + return 0; +} + +void __exit blocker_exit(void) +{ + printk(KERN_INFO "blocker device uninstalled\n"); + misc_deregister(&blocker_dev); +} + +module_init(blocker_init); +module_exit(blocker_exit); + +MODULE_LICENSE("GPL"); + Index: linux/drivers/char/hangcheck-timer.c =================================================================== --- linux.orig/drivers/char/hangcheck-timer.c +++ linux/drivers/char/hangcheck-timer.c @@ -117,7 +117,7 @@ __setup("hcheck_reboot", hangcheck_parse __setup("hcheck_dump_tasks", hangcheck_parse_dump_tasks); #endif /* not MODULE */ -#if defined(CONFIG_X86_64) || defined(CONFIG_S390) +#if defined(CONFIG_S390) # define HAVE_MONOTONIC # define TIMER_FREQ 1000000000ULL #elif defined(CONFIG_IA64) Index: linux/drivers/char/hpet.c =================================================================== --- linux.orig/drivers/char/hpet.c +++ linux/drivers/char/hpet.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -50,8 +51,34 @@ #define HPET_RANGE_SIZE 1024 /* from HPET spec */ +#if BITS_PER_LONG == 64 +#define write_counter(V, MC) writeq(V, MC) +#define read_counter(MC) readq(MC) +#else +#define write_counter(V, MC) writel(V, MC) +#define read_counter(MC) readl(MC) +#endif + static u32 hpet_nhpet, hpet_max_freq = HPET_USER_FREQ; +static void __iomem *hpet_mc_ptr; + +static cycle_t read_hpet(void) +{ + return (cycle_t)read_counter((void __iomem *)hpet_mc_ptr); +} + +static struct clocksource clocksource_hpet = { + .name = "hpet", + .rating = 300, + .read = read_hpet, + .mask = 0xffffffffffffffffLL, + .mult = 0, /*to be caluclated*/ + .shift = 10, + .is_continuous = 1, +}; +static struct clocksource *hpet_clocksource_p; + /* A lock for concurrent access by app and isr hpet activity. */ static DEFINE_SPINLOCK(hpet_lock); /* A lock for concurrent intermodule access to hpet and isr hpet activity. */ @@ -78,7 +105,7 @@ struct hpets { struct hpets *hp_next; struct hpet __iomem *hp_hpet; unsigned long hp_hpet_phys; - struct time_interpolator *hp_interpolator; + struct clocksource *hp_clocksource; unsigned long long hp_tick_freq; unsigned long hp_delta; unsigned int hp_ntimer; @@ -93,13 +120,6 @@ static struct hpets *hpets; #define HPET_PERIODIC 0x0004 #define HPET_SHARED_IRQ 0x0008 -#if BITS_PER_LONG == 64 -#define write_counter(V, MC) writeq(V, MC) -#define read_counter(MC) readq(MC) -#else -#define write_counter(V, MC) writel(V, MC) -#define read_counter(MC) readl(MC) -#endif #ifndef readq static inline unsigned long long readq(void __iomem *addr) @@ -736,27 +756,6 @@ static ctl_table dev_root[] = { static struct ctl_table_header *sysctl_header; -static void hpet_register_interpolator(struct hpets *hpetp) -{ -#ifdef CONFIG_TIME_INTERPOLATION - struct time_interpolator *ti; - - ti = kzalloc(sizeof(*ti), GFP_KERNEL); - if (!ti) - return; - - ti->source = TIME_SOURCE_MMIO64; - ti->shift = 10; - ti->addr = &hpetp->hp_hpet->hpet_mc; - ti->frequency = hpetp->hp_tick_freq; - ti->drift = HPET_DRIFT; - ti->mask = -1; - - hpetp->hp_interpolator = ti; - register_time_interpolator(ti); -#endif -} - /* * Adjustment for when arming the timer with * initial conditions. That is, main counter @@ -908,7 +907,14 @@ int hpet_alloc(struct hpet_data *hdp) } hpetp->hp_delta = hpet_calibrate(hpetp); - hpet_register_interpolator(hpetp); + + if (!hpet_clocksource_p) { + hpet_mc_ptr = &hpetp->hp_hpet->hpet_mc; + clocksource_hpet.mult = clocksource_hz2mult(hpetp->hp_tick_freq, + clocksource_hpet.shift); + clocksource_register(&clocksource_hpet); + hpet_clocksource_p = hpetp->hp_clocksource = &clocksource_hpet; + } return 0; } @@ -994,7 +1000,7 @@ static int hpet_acpi_add(struct acpi_dev static int hpet_acpi_remove(struct acpi_device *device, int type) { - /* XXX need to unregister interpolator, dealloc mem, etc */ + /* XXX need to unregister clocksource, dealloc mem, etc */ return -EINVAL; } Index: linux/drivers/char/lpptest.c =================================================================== --- /dev/null +++ linux/drivers/char/lpptest.c @@ -0,0 +1,178 @@ +/* + * /dev/lpptest device: test IRQ handling latencies over parallel port + * + * Copyright (C) 2005 Thomas Gleixner, Ingo Molnar + * + * licensed under the GPL + * + * You need to have CONFIG_PARPORT disabled for this device, it is a + * completely self-contained device that assumes sole ownership of the + * parallel port. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * API wrappers so that the code can be shared with the -rt tree: + */ +#ifndef local_irq_disable +# define local_irq_disable local_irq_disable +# define local_irq_enable local_irq_enable +#endif + +#ifndef IRQ_NODELAY +# define IRQ_NODELAY 0 +# define IRQF_NODELAY 0 +#endif + +/* + * Driver: + */ +#define LPPTEST_CHAR_MAJOR 245 +#define LPPTEST_DEVICE_NAME "lpptest" + +#define LPPTEST_IRQ 7 + +#define LPPTEST_TEST _IOR (LPPTEST_CHAR_MAJOR, 1, unsigned long long) +#define LPPTEST_DISABLE _IOR (LPPTEST_CHAR_MAJOR, 2, unsigned long long) +#define LPPTEST_ENABLE _IOR (LPPTEST_CHAR_MAJOR, 3, unsigned long long) + +static char dev_id[] = "lpptest"; + +#define INIT_PORT() outb(0x04, 0x37a) +#define ENABLE_IRQ() outb(0x10, 0x37a) +#define DISABLE_IRQ() outb(0, 0x37a) + +static unsigned char out = 0x5a; + +/** + * Interrupt handler. Flip a bit in the reply. + */ +static int lpptest_irq (int irq, void *dev_id) +{ + out ^= 0xff; + outb(out, 0x378); + + return IRQ_HANDLED; +} + +static cycles_t test_response(void) +{ + cycles_t now, end; + unsigned char in; + int timeout = 0; + + local_irq_disable(); + in = inb(0x379); + inb(0x378); + outb(0x08, 0x378); + now = get_cycles(); + while(1) { + if (inb(0x379) != in) + break; + if (timeout++ > 1000000) { + outb(0x00, 0x378); + local_irq_enable(); + + return 0; + } + } + end = get_cycles(); + outb(0x00, 0x378); + local_irq_enable(); + + return end - now; +} + +static int lpptest_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static int lpptest_close(struct inode *inode, struct file *file) +{ + return 0; +} + +int lpptest_ioctl(struct inode *inode, struct file *file, unsigned int ioctl_num, unsigned long ioctl_param) +{ + int retval = 0; + + switch (ioctl_num) { + + case LPPTEST_DISABLE: + DISABLE_IRQ(); + break; + + case LPPTEST_ENABLE: + ENABLE_IRQ(); + break; + + case LPPTEST_TEST: { + + cycles_t diff = test_response(); + if (copy_to_user((void *)ioctl_param, (void*) &diff, sizeof(diff))) + goto errcpy; + break; + } + default: retval = -EINVAL; + } + + return retval; + + errcpy: + return -EFAULT; +} + +static struct file_operations lpptest_dev_fops = { + .ioctl = lpptest_ioctl, + .open = lpptest_open, + .release = lpptest_close, +}; + +static int __init lpptest_init (void) +{ + if (register_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME, &lpptest_dev_fops)) + { + printk(KERN_NOTICE "Can't allocate major number %d for lpptest.\n", + LPPTEST_CHAR_MAJOR); + return -EAGAIN; + } + + if (request_irq (LPPTEST_IRQ, lpptest_irq, 0, "lpptest", dev_id)) { + printk (KERN_WARNING "lpptest: irq %d in use. Unload parport module!\n", LPPTEST_IRQ); + unregister_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME); + return -EAGAIN; + } + irq_desc[LPPTEST_IRQ].status |= IRQ_NODELAY; + irq_desc[LPPTEST_IRQ].action->flags |= IRQF_NODELAY | IRQF_DISABLED; + + INIT_PORT(); + ENABLE_IRQ(); + + return 0; +} +module_init (lpptest_init); + +static void __exit lpptest_exit (void) +{ + DISABLE_IRQ(); + + free_irq(LPPTEST_IRQ, dev_id); + unregister_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME); +} +module_exit (lpptest_exit); + +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("lpp test module"); + Index: linux/drivers/char/random.c =================================================================== --- linux.orig/drivers/char/random.c +++ linux/drivers/char/random.c @@ -580,8 +580,11 @@ static void add_timer_randomness(struct preempt_disable(); /* if over the trickle threshold, use only 1 in 4096 samples */ if (input_pool.entropy_count > trickle_thresh && - (__get_cpu_var(trickle_count)++ & 0xfff)) - goto out; + (__get_cpu_var(trickle_count)++ & 0xfff)) { + preempt_enable(); + return; + } + preempt_enable(); sample.jiffies = jiffies; sample.cycles = get_cycles(); @@ -626,9 +629,6 @@ static void add_timer_randomness(struct if(input_pool.entropy_count >= random_read_wakeup_thresh) wake_up_interruptible(&random_read_wait); - -out: - preempt_enable(); } void add_input_randomness(unsigned int type, unsigned int code, Index: linux/drivers/char/rtc.c =================================================================== --- linux.orig/drivers/char/rtc.c +++ linux/drivers/char/rtc.c @@ -82,10 +82,36 @@ #include #include +#ifdef CONFIG_MIPS +# include +#endif + #if defined(__i386__) #include #endif +#ifdef CONFIG_RTC_HISTOGRAM + +static cycles_t last_interrupt_time; + +#include + +#define CPU_MHZ (cpu_khz / 1000) + +#define HISTSIZE 10000 +static int histogram[HISTSIZE]; + +static int rtc_state; + +enum rtc_states { + S_STARTUP, /* First round - let the application start */ + S_IDLE, /* Waiting for an interrupt */ + S_WAITING_FOR_READ, /* Signal delivered. waiting for rtc_read() */ + S_READ_MISSED, /* Signal delivered, read() deadline missed */ +}; + +#endif + #ifdef __sparc__ #include #include @@ -218,7 +244,146 @@ static inline unsigned char rtc_is_updat return uip; } +#ifndef RTC_IRQ +# undef CONFIG_RTC_HISTOGRAM +#endif + +static inline void rtc_open_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + int i; + + last_interrupt_time = 0; + rtc_state = S_STARTUP; + rtc_irq_data = 0; + + for (i = 0; i < HISTSIZE; i++) + histogram[i] = 0; +#endif +} + +static inline void rtc_wake_event(void) +{ +#ifndef CONFIG_RTC_HISTOGRAM + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); +#else + if (!(rtc_status & RTC_IS_OPEN)) + return; + + switch (rtc_state) { + /* Startup */ + case S_STARTUP: + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); + break; + /* Waiting for an interrupt */ + case S_IDLE: + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); + last_interrupt_time = get_cycles(); + rtc_state = S_WAITING_FOR_READ; + break; + + /* Signal has been delivered. waiting for rtc_read() */ + case S_WAITING_FOR_READ: + /* + * Well foo. The usermode application didn't + * schedule and read in time. + */ + last_interrupt_time = get_cycles(); + rtc_state = S_READ_MISSED; + printk("Read missed before next interrupt\n"); + break; + /* Signal has been delivered, read() deadline was missed */ + case S_READ_MISSED: + /* + * Not much we can do here. We're waiting for the usermode + * application to read the rtc + */ + last_interrupt_time = get_cycles(); + break; + } +#endif +} + +static inline void rtc_read_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + cycles_t now = get_cycles(); + + switch (rtc_state) { + /* Startup */ + case S_STARTUP: + rtc_state = S_IDLE; + break; + + /* Waiting for an interrupt */ + case S_IDLE: + printk("bug in rtc_read(): called in state S_IDLE!\n"); + break; + case S_WAITING_FOR_READ: /* + * Signal has been delivered. + * waiting for rtc_read() + */ + /* + * Well done + */ + case S_READ_MISSED: /* + * Signal has been delivered, read() + * deadline was missed + */ + /* + * So, you finally got here. + */ + if (!last_interrupt_time) + printk("bug in rtc_read(): last_interrupt_time = 0\n"); + rtc_state = S_IDLE; + { + cycles_t latency = now - last_interrupt_time; + unsigned long delta; /* Microseconds */ + + delta = latency; + delta /= CPU_MHZ; + + if (delta > 1000 * 1000) { + printk("rtc: eek\n"); + } else { + unsigned long slot = delta; + if (slot >= HISTSIZE) + slot = HISTSIZE - 1; + histogram[slot]++; + if (delta > 2000) + printk("wow! That was a " + "%ld millisec bump\n", + delta / 1000); + } + } + rtc_state = S_IDLE; + break; + } +#endif +} + +static inline void rtc_close_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + int i = 0; + unsigned long total = 0; + + for (i = 0; i < HISTSIZE; i++) + total += histogram[i]; + if (!total) + return; + + printk("\nrtc latency histogram of {%s/%d, %lu samples}:\n", + current->comm, current->pid, total); + for (i = 0; i < HISTSIZE; i++) { + if (histogram[i]) + printk("%d %d\n", i, histogram[i]); + } +#endif +} + #ifdef RTC_IRQ + /* * A very tiny interrupt handler. It runs with IRQF_DISABLED set, * but there is possibility of conflicting with the set_rtc_mmss() @@ -262,7 +427,7 @@ irqreturn_t rtc_interrupt(int irq, void if (rtc_callback) rtc_callback->func(rtc_callback->private_data); spin_unlock(&rtc_task_lock); - wake_up_interruptible(&rtc_wait); + wake_up_interruptible(&rtc_wait); kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); @@ -376,6 +541,8 @@ static ssize_t rtc_read(struct file *fil schedule(); } while (1); + rtc_read_event(); + if (count == sizeof(unsigned int)) retval = put_user(data, (unsigned int __user *)buf) ?: sizeof(int); else @@ -608,6 +775,11 @@ static int rtc_do_ioctl(unsigned int cmd save_freq_select = CMOS_READ(RTC_FREQ_SELECT); CMOS_WRITE((save_freq_select|RTC_DIV_RESET2), RTC_FREQ_SELECT); + /* + * Make CMOS date writes nonpreemptible even on PREEMPT_RT. + * There's a limit to everything! =B-) + */ + preempt_disable(); #ifdef CONFIG_MACH_DECSTATION CMOS_WRITE(real_yrs, RTC_DEC_YEAR); #endif @@ -617,6 +789,7 @@ static int rtc_do_ioctl(unsigned int cmd CMOS_WRITE(hrs, RTC_HOURS); CMOS_WRITE(min, RTC_MINUTES); CMOS_WRITE(sec, RTC_SECONDS); + preempt_enable(); CMOS_WRITE(save_control, RTC_CONTROL); CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT); @@ -715,6 +888,7 @@ static int rtc_open(struct inode *inode, if(rtc_status & RTC_IS_OPEN) goto out_busy; + rtc_open_event(); rtc_status |= RTC_IS_OPEN; rtc_irq_data = 0; @@ -770,6 +944,7 @@ no_irq: rtc_irq_data = 0; rtc_status &= ~RTC_IS_OPEN; spin_unlock_irq (&rtc_lock); + rtc_close_event(); return 0; } @@ -1153,6 +1328,7 @@ static void rtc_dropped_irq(unsigned lon printk(KERN_WARNING "rtc: lost some interrupts at %ldHz.\n", freq); /* Now we have new data */ + rtc_wake_event(); wake_up_interruptible(&rtc_wait); kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); Index: linux/drivers/char/sysrq.c =================================================================== --- linux.orig/drivers/char/sysrq.c +++ linux/drivers/char/sysrq.c @@ -171,6 +171,22 @@ static struct sysrq_key_op sysrq_showreg .enable_mask = SYSRQ_ENABLE_DUMP, }; +#if defined(__i386__) + +static void sysrq_handle_showallregs(int key, struct tty_struct *tty) +{ + nmi_show_all_regs(); +} + +static struct sysrq_key_op sysrq_showallregs_op = { + .handler = sysrq_handle_showallregs, + .help_msg = "showalLcpupc", + .action_msg = "Show Regs On All CPUs", +}; +#else +#define sysrq_showallregs_op (*(struct sysrq_key_op *)0) +#endif + static void sysrq_handle_showstate(int key, struct tty_struct *tty) { show_state(); @@ -290,7 +306,7 @@ static struct sysrq_key_op *sysrq_key_ta &sysrq_kill_op, /* i */ NULL, /* j */ &sysrq_SAK_op, /* k */ - NULL, /* l */ + &sysrq_showallregs_op, /* l */ &sysrq_showmem_op, /* m */ &sysrq_unrt_op, /* n */ /* This will often be registered as 'Off' at init time */ Index: linux/drivers/char/tty_io.c =================================================================== --- linux.orig/drivers/char/tty_io.c +++ linux/drivers/char/tty_io.c @@ -249,6 +249,7 @@ static int check_tty_count(struct tty_st printk(KERN_WARNING "Warning: dev (%s) tty->count(%d) " "!= #fd's(%d) in %s\n", tty->name, tty->count, count, routine); + dump_stack(); return count; } #endif @@ -1506,9 +1507,6 @@ void disassociate_ctty(int on_exit) /* Must lock changes to tty_old_pgrp */ mutex_lock(&tty_mutex); - current->signal->tty_old_pgrp = 0; - tty->session = 0; - tty->pgrp = -1; /* Now clear signal->tty under the lock */ read_lock(&tasklist_lock); @@ -1516,6 +1514,11 @@ void disassociate_ctty(int on_exit) p->signal->tty = NULL; } while_each_task_pid(current->signal->session, PIDTYPE_SID, p); read_unlock(&tasklist_lock); + + current->signal->tty_old_pgrp = 0; + tty->session = 0; + tty->pgrp = -1; + mutex_unlock(&tty_mutex); unlock_kernel(); } Index: linux/drivers/clocksource/acpi_pm.c =================================================================== --- linux.orig/drivers/clocksource/acpi_pm.c +++ linux/drivers/clocksource/acpi_pm.c @@ -16,15 +16,13 @@ * This file is licensed under the GPL v2. */ +#include #include #include #include #include #include -/* Number of PMTMR ticks expected during calibration run */ -#define PMTMR_TICKS_PER_SEC 3579545 - /* * The I/O port the PMTMR resides at. * The location is detected during setup_arch(), @@ -32,15 +30,13 @@ */ u32 pmtmr_ioport __read_mostly; -#define ACPI_PM_MASK CLOCKSOURCE_MASK(24) /* limit it to 24 bits */ - -static inline u32 read_pmtmr(void) +static notrace inline u32 read_pmtmr(void) { /* mask the output to 24 bits */ return inl(pmtmr_ioport) & ACPI_PM_MASK; } -static cycle_t acpi_pm_read_verified(void) +u32 notrace acpi_pm_read_verified(void) { u32 v1 = 0, v2 = 0, v3 = 0; @@ -57,10 +53,15 @@ static cycle_t acpi_pm_read_verified(voi } while (unlikely((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1) || (v3 > v1 && v3 < v2))); - return (cycle_t)v2; + return v2; +} + +static notrace cycle_t acpi_pm_read_slow(void) +{ + return (cycle_t)acpi_pm_read_verified(); } -static cycle_t acpi_pm_read(void) +static notrace cycle_t acpi_pm_read(void) { return (cycle_t)read_pmtmr(); } @@ -87,7 +88,7 @@ __setup("acpi_pm_good", acpi_pm_good_set static inline void acpi_pm_need_workaround(void) { - clocksource_acpi_pm.read = acpi_pm_read_verified; + clocksource_acpi_pm.read = acpi_pm_read_slow; clocksource_acpi_pm.rating = 110; } Index: linux/drivers/cpufreq/cpufreq.c =================================================================== --- linux.orig/drivers/cpufreq/cpufreq.c +++ linux/drivers/cpufreq/cpufreq.c @@ -1535,7 +1535,6 @@ int cpufreq_update_policy(unsigned int c } EXPORT_SYMBOL(cpufreq_update_policy); -#ifdef CONFIG_HOTPLUG_CPU static int cpufreq_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -1575,7 +1574,6 @@ static struct notifier_block __cpuinitda { .notifier_call = cpufreq_cpu_callback, }; -#endif /* CONFIG_HOTPLUG_CPU */ /********************************************************************* * REGISTER / UNREGISTER CPUFREQ DRIVER * Index: linux/drivers/ide/ide-floppy.c =================================================================== --- linux.orig/drivers/ide/ide-floppy.c +++ linux/drivers/ide/ide-floppy.c @@ -1665,9 +1665,9 @@ static int idefloppy_get_format_progress atapi_status_t status; unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); status.all = HWIF(drive)->INB(IDE_STATUS_REG); - local_irq_restore(flags); + local_irq_restore_nort(flags); progress_indication = !status.b.dsc ? 0 : 0x10000; } Index: linux/drivers/ide/ide-io.c =================================================================== --- linux.orig/drivers/ide/ide-io.c +++ linux/drivers/ide/ide-io.c @@ -1185,7 +1185,7 @@ static void ide_do_request (ide_hwgroup_ ide_get_lock(ide_intr, hwgroup); /* caller must own ide_lock */ - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); while (!hwgroup->busy) { hwgroup->busy = 1; @@ -1450,7 +1450,7 @@ void ide_timer_expiry (unsigned long dat #endif /* DISABLE_IRQ_NOSYNC */ /* local CPU only, * as if we were handling an interrupt */ - local_irq_disable(); + local_irq_disable_nort(); if (hwgroup->polling) { startstop = handler(drive); } else if (drive_is_ready(drive)) { Index: linux/drivers/ide/ide-iops.c =================================================================== --- linux.orig/drivers/ide/ide-iops.c +++ linux/drivers/ide/ide-iops.c @@ -244,10 +244,10 @@ static void ata_input_data(ide_drive_t * if (io_32bit) { if (io_32bit & 2) { unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(drive, IDE_NSECTOR_REG); hwif->INSL(IDE_DATA_REG, buffer, wcount); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else hwif->INSL(IDE_DATA_REG, buffer, wcount); } else { @@ -266,10 +266,10 @@ static void ata_output_data(ide_drive_t if (io_32bit) { if (io_32bit & 2) { unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(drive, IDE_NSECTOR_REG); hwif->OUTSL(IDE_DATA_REG, buffer, wcount); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else hwif->OUTSL(IDE_DATA_REG, buffer, wcount); } else { @@ -564,12 +564,12 @@ int ide_wait_stat (ide_startstop_t *star if (!(stat & BUSY_STAT)) break; - local_irq_restore(flags); + local_irq_restore_nort(flags); *startstop = ide_error(drive, "status timeout", stat); return 1; } } - local_irq_restore(flags); + local_irq_restore_nort(flags); } /* * Allow status to settle, then read it again. @@ -731,17 +731,15 @@ int ide_driveid_update (ide_drive_t *dri printk("%s: CHECK for good STATUS\n", drive->name); return 0; } - local_irq_save(flags); - SELECT_MASK(drive, 0); id = kmalloc(SECTOR_WORDS*4, GFP_ATOMIC); - if (!id) { - local_irq_restore(flags); + if (!id) return 0; - } + local_irq_save_nort(flags); + SELECT_MASK(drive, 0); ata_input_data(drive, id, SECTOR_WORDS); (void) hwif->INB(IDE_STATUS_REG); /* clear drive IRQ */ - local_irq_enable(); - local_irq_restore(flags); + local_irq_enable_nort(); + local_irq_restore_nort(flags); ide_fix_driveid(id); if (id) { drive->id->dma_ultra = id->dma_ultra; @@ -821,7 +819,7 @@ int ide_config_drive_speed (ide_drive_t if (time_after(jiffies, timeout)) break; } - local_irq_restore(flags); + local_irq_restore_nort(flags); } /* Index: linux/drivers/ide/ide-lib.c =================================================================== --- linux.orig/drivers/ide/ide-lib.c +++ linux/drivers/ide/ide-lib.c @@ -466,15 +466,16 @@ EXPORT_SYMBOL_GPL(ide_set_xfer_rate); static void ide_dump_opcode(ide_drive_t *drive) { + unsigned long flags; struct request *rq; u8 opcode = 0; int found = 0; - spin_lock(&ide_lock); + spin_lock_irqsave(&ide_lock, flags); rq = NULL; if (HWGROUP(drive)) rq = HWGROUP(drive)->rq; - spin_unlock(&ide_lock); + spin_unlock_irqrestore(&ide_lock, flags); if (!rq) return; if (rq->cmd_type == REQ_TYPE_ATA_CMD || @@ -503,10 +504,8 @@ static void ide_dump_opcode(ide_drive_t static u8 ide_dump_ata_status(ide_drive_t *drive, const char *msg, u8 stat) { ide_hwif_t *hwif = HWIF(drive); - unsigned long flags; u8 err = 0; - local_irq_save(flags); printk("%s: %s: status=0x%02x { ", drive->name, msg, stat); if (stat & BUSY_STAT) printk("Busy "); @@ -566,7 +565,7 @@ static u8 ide_dump_ata_status(ide_drive_ printk("\n"); } ide_dump_opcode(drive); - local_irq_restore(flags); + return err; } @@ -581,14 +580,11 @@ static u8 ide_dump_ata_status(ide_drive_ static u8 ide_dump_atapi_status(ide_drive_t *drive, const char *msg, u8 stat) { - unsigned long flags; - atapi_status_t status; atapi_error_t error; status.all = stat; error.all = 0; - local_irq_save(flags); printk("%s: %s: status=0x%02x { ", drive->name, msg, stat); if (status.b.bsy) printk("Busy "); @@ -614,7 +610,7 @@ static u8 ide_dump_atapi_status(ide_driv printk("}\n"); } ide_dump_opcode(drive); - local_irq_restore(flags); + return error.all; } Index: linux/drivers/ide/ide-probe.c =================================================================== --- linux.orig/drivers/ide/ide-probe.c +++ linux/drivers/ide/ide-probe.c @@ -143,7 +143,7 @@ static inline void do_identify (ide_driv hwif->ata_input_data(drive, id, SECTOR_WORDS); drive->id_read = 1; - local_irq_enable(); + local_irq_enable_nort(); ide_fix_driveid(id); #if defined (CONFIG_SCSI_EATA_DMA) || defined (CONFIG_SCSI_EATA_PIO) || defined (CONFIG_SCSI_EATA) @@ -325,14 +325,14 @@ static int actual_try_to_identify (ide_d unsigned long flags; /* local CPU only; some systems need this */ - local_irq_save(flags); + local_irq_save_nort(flags); /* drive returned ID */ do_identify(drive, cmd); /* drive responded with ID */ rc = 0; /* clear drive IRQ */ (void) hwif->INB(IDE_STATUS_REG); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else { /* drive refused ID */ rc = 2; @@ -809,7 +809,7 @@ static void probe_hwif(ide_hwif_t *hwif) } while ((stat & BUSY_STAT) && time_after(timeout, jiffies)); } - local_irq_restore(flags); + local_irq_restore_nort(flags); /* * Use cached IRQ number. It might be (and is...) changed by probe * code above Index: linux/drivers/ide/ide-taskfile.c =================================================================== --- linux.orig/drivers/ide/ide-taskfile.c +++ linux/drivers/ide/ide-taskfile.c @@ -274,7 +274,7 @@ static void ide_pio_sector(ide_drive_t * offset %= PAGE_SIZE; #ifdef CONFIG_HIGHMEM - local_irq_save(flags); + local_irq_save_nort(flags); #endif buf = kmap_atomic(page, KM_BIO_SRC_IRQ) + offset; @@ -294,7 +294,7 @@ static void ide_pio_sector(ide_drive_t * kunmap_atomic(buf, KM_BIO_SRC_IRQ); #ifdef CONFIG_HIGHMEM - local_irq_restore(flags); + local_irq_restore_nort(flags); #endif } @@ -460,7 +460,7 @@ ide_startstop_t pre_task_out_intr (ide_d } if (!drive->unmask) - local_irq_disable(); + local_irq_disable_nort(); ide_set_handler(drive, &task_out_intr, WAIT_WORSTCASE, NULL); ide_pio_datablock(drive, rq, 1); Index: linux/drivers/ide/pci/alim15x3.c =================================================================== --- linux.orig/drivers/ide/pci/alim15x3.c +++ linux/drivers/ide/pci/alim15x3.c @@ -322,7 +322,7 @@ static void ali15x3_tune_drive (ide_driv if (r_clc >= 16) r_clc = 0; } - local_irq_save(flags); + local_irq_save_nort(flags); /* * PIO mode => ATA FIFO on, ATAPI FIFO off @@ -344,7 +344,7 @@ static void ali15x3_tune_drive (ide_driv pci_write_config_byte(dev, port, s_clc); pci_write_config_byte(dev, port+drive->select.b.unit+2, (a_clc << 4) | r_clc); - local_irq_restore(flags); + local_irq_restore_nort(flags); /* * setup active rec @@ -600,7 +600,7 @@ static unsigned int __devinit init_chips } #endif /* defined(DISPLAY_ALI_TIMINGS) && defined(CONFIG_PROC_FS) */ - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision < 0xC2) { /* @@ -613,7 +613,7 @@ static unsigned int __devinit init_chips * clear bit 7 */ pci_write_config_byte(dev, 0x4b, tmpbyte & 0x7F); - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } @@ -638,7 +638,7 @@ static unsigned int __devinit init_chips * 0:0.0 so if we didn't find one we know what is cooking. */ if (north && north->vendor != PCI_VENDOR_ID_AL) { - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } @@ -661,7 +661,7 @@ static unsigned int __devinit init_chips pci_write_config_byte(isa_dev, 0x79, tmpbyte | 0x02); } } - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } @@ -685,7 +685,7 @@ static unsigned int __devinit ata66_ali1 unsigned long flags; u8 tmpbyte; - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision >= 0xC2) { /* @@ -737,7 +737,7 @@ static unsigned int __devinit ata66_ali1 pci_write_config_byte(dev, 0x53, tmpbyte); - local_irq_restore(flags); + local_irq_restore_nort(flags); return(ata66); } Index: linux/drivers/ide/pci/cs5530.c =================================================================== --- linux.orig/drivers/ide/pci/cs5530.c +++ linux/drivers/ide/pci/cs5530.c @@ -241,8 +241,8 @@ static unsigned int __devinit init_chips goto out; } - spin_lock_irqsave(&ide_lock, flags); - /* all CPUs (there should only be one CPU with this chipset) */ + /* Local CPU. ide_lock is acquired in do_ide_setup_pci_device. */ + local_irq_save(flags); /* * Enable BusMaster and MemoryWriteAndInvalidate for the cs5530: @@ -294,7 +294,7 @@ static unsigned int __devinit init_chips pci_write_config_byte(master_0, 0x42, 0x00); pci_write_config_byte(master_0, 0x43, 0xc1); - spin_unlock_irqrestore(&ide_lock, flags); + local_irq_restore(flags); out: pci_dev_put(master_0); Index: linux/drivers/ide/pci/hpt366.c =================================================================== --- linux.orig/drivers/ide/pci/hpt366.c +++ linux/drivers/ide/pci/hpt366.c @@ -1496,7 +1496,7 @@ static void __devinit init_dma_hpt366(id dma_old = hwif->INB(dmabase+2); - local_irq_save(flags); + local_irq_save_nort(flags); dma_new = dma_old; pci_read_config_byte(hwif->pci_dev, primary, &masterdma); @@ -1507,7 +1507,7 @@ static void __devinit init_dma_hpt366(id if (dma_new != dma_old) hwif->OUTB(dma_new, dmabase+2); - local_irq_restore(flags); + local_irq_restore_nort(flags); ide_setup_dma(hwif, dmabase, 8); } Index: linux/drivers/input/gameport/gameport.c =================================================================== --- linux.orig/drivers/input/gameport/gameport.c +++ linux/drivers/input/gameport/gameport.c @@ -21,6 +21,7 @@ #include #include #include +#include #include /* HZ */ #include @@ -101,12 +102,12 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); GET_TIME(t1); for (t = 0; t < 50; t++) gameport_read(gameport); GET_TIME(t2); GET_TIME(t3); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if ((t = DELTA(t2,t1) - DELTA(t3,t2)) < tx) tx = t; } @@ -125,11 +126,11 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); rdtscl(t1); for (t = 0; t < 50; t++) gameport_read(gameport); rdtscl(t2); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if (t2 - t1 < tx) tx = t2 - t1; } Index: linux/drivers/kvm/Kconfig =================================================================== --- /dev/null +++ linux/drivers/kvm/Kconfig @@ -0,0 +1,18 @@ +# +# KVM configuration +# +config KVM + tristate "Kernel-based Virtual Machine (KVM) support" + depends on X86 && EXPERIMENTAL + ---help--- + Support hosting fully virtualized guest machines using hardware + virtualization extensions. You will need a fairly recent Intel + processor equipped with VT extensions. + + This module provides access to the hardware capabilities through + a character device node named /dev/kvm. + + To compile this as a module, choose M here: the module + will be called kvm. + + If unsure, say N. Index: linux/drivers/kvm/Makefile =================================================================== --- /dev/null +++ linux/drivers/kvm/Makefile @@ -0,0 +1,9 @@ +# +# Makefile for Kernel-based Virtual Machine module +# + +obj-$(CONFIG_KVM) += kvm.o kvm-intel.o kvm-amd.o + +kvm-objs := kvm_main.o mmu.o x86_emulate.o debug.o +kvm-intel-objs := vmx.o +kvm-amd-objs := svm.o Index: linux/drivers/kvm/debug.c =================================================================== --- /dev/null +++ linux/drivers/kvm/debug.c @@ -0,0 +1,1050 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables machines with Intel VT-x extensions to run virtual + * machines without emulation or binary translation. + * + * Debug support + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + * + */ + +#include + +#include "kvm.h" +#include "debug.h" + +#ifdef KVM_DEBUG + +static const char *vmx_msr_name[] = { + "MSR_EFER", "MSR_STAR", "MSR_CSTAR", + "MSR_KERNEL_GS_BASE", "MSR_SYSCALL_MASK", "MSR_LSTAR" +}; + +#define NR_VMX_MSR (sizeof(vmx_msr_name) / sizeof(char*)) + +void show_msrs(struct kvm_vcpu *vcpu) +{ + int i; + + for (i = 0; i < NR_VMX_MSR; ++i) { + vcpu_printf(vcpu, "%s: %s=0x%llx\n", + __FUNCTION__, + vmx_msr_name[i], + vcpu->guest_msrs[i].data); + } +} + +void show_code(struct kvm_vcpu *vcpu) +{ + gva_t rip = vmcs_readl(GUEST_RIP); + u8 code[50]; + char buf[30 + 3 * sizeof code]; + int i; + + if (!is_long_mode()) + rip += vmcs_readl(GUEST_CS_BASE); + + kvm_read_guest(vcpu, rip, sizeof code, code); + for (i = 0; i < sizeof code; ++i) + sprintf(buf + i * 3, " %02x", code[i]); + vcpu_printf(vcpu, "code: %lx%s\n", rip, buf); +} + +struct gate_struct { + u16 offset_low; + u16 segment; + unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1; + u16 offset_middle; + u32 offset_high; + u32 zero1; +} __attribute__((packed)); + +void show_irq(struct kvm_vcpu *vcpu, int irq) +{ + unsigned long idt_base = vmcs_readl(GUEST_IDTR_BASE); + unsigned long idt_limit = vmcs_readl(GUEST_IDTR_LIMIT); + struct gate_struct gate; + + if (!is_long_mode()) + vcpu_printf(vcpu, "%s: not in long mode\n", __FUNCTION__); + + if (!is_long_mode() || idt_limit < irq * sizeof(gate)) { + vcpu_printf(vcpu, "%s: 0x%x read_guest err\n", + __FUNCTION__, + irq); + return; + } + + if (kvm_read_guest(vcpu, idt_base + irq * sizeof(gate), sizeof(gate), &gate) != sizeof(gate)) { + vcpu_printf(vcpu, "%s: 0x%x read_guest err\n", + __FUNCTION__, + irq); + return; + } + vcpu_printf(vcpu, "%s: 0x%x handler 0x%llx\n", + __FUNCTION__, + irq, + ((u64)gate.offset_high << 32) | + ((u64)gate.offset_middle << 16) | + gate.offset_low); +} + +void show_page(struct kvm_vcpu *vcpu, + gva_t addr) +{ + u64 *buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + + if (!buf) + return; + + addr &= PAGE_MASK; + if (kvm_read_guest(vcpu, addr, PAGE_SIZE, buf)) { + int i; + for (i = 0; i < PAGE_SIZE / sizeof(u64) ; i++) { + u8 *ptr = (u8*)&buf[i]; + int j; + vcpu_printf(vcpu, " 0x%16.16lx:", + addr + i * sizeof(u64)); + for (j = 0; j < sizeof(u64) ; j++) + vcpu_printf(vcpu, " 0x%2.2x", ptr[j]); + vcpu_printf(vcpu, "\n"); + } + } + kfree(buf); +} + +void show_u64(struct kvm_vcpu *vcpu, gva_t addr) +{ + u64 buf; + + if (kvm_read_guest(vcpu, addr, sizeof(u64), &buf) == sizeof(u64)) { + u8 *ptr = (u8*)&buf; + int j; + vcpu_printf(vcpu, " 0x%16.16lx:", addr); + for (j = 0; j < sizeof(u64) ; j++) + vcpu_printf(vcpu, " 0x%2.2x", ptr[j]); + vcpu_printf(vcpu, "\n"); + } +} + +#define IA32_DEBUGCTL_RESERVED_BITS 0xfffffffffffffe3cULL + +static int is_canonical(unsigned long addr) +{ + return addr == ((long)addr << 16) >> 16; +} + +int vm_entry_test_guest(struct kvm_vcpu *vcpu) +{ + unsigned long cr0; + unsigned long cr4; + unsigned long cr3; + unsigned long dr7; + u64 ia32_debugctl; + unsigned long sysenter_esp; + unsigned long sysenter_eip; + unsigned long rflags; + + int long_mode; + int virtual8086; + + #define RFLAGS_VM (1 << 17) + #define RFLAGS_RF (1 << 9) + + + #define VIR8086_SEG_BASE_TEST(seg)\ + if (vmcs_readl(GUEST_##seg##_BASE) != \ + (unsigned long)vmcs_read16(GUEST_##seg##_SELECTOR) << 4) {\ + vcpu_printf(vcpu, "%s: "#seg" base 0x%lx in "\ + "virtual8086 is not "#seg" selector 0x%x"\ + " shifted right 4 bits\n",\ + __FUNCTION__,\ + vmcs_readl(GUEST_##seg##_BASE),\ + vmcs_read16(GUEST_##seg##_SELECTOR));\ + return 0;\ + } + + #define VIR8086_SEG_LIMIT_TEST(seg)\ + if (vmcs_readl(GUEST_##seg##_LIMIT) != 0x0ffff) { \ + vcpu_printf(vcpu, "%s: "#seg" limit 0x%lx in "\ + "virtual8086 is not 0xffff\n",\ + __FUNCTION__,\ + vmcs_readl(GUEST_##seg##_LIMIT));\ + return 0;\ + } + + #define VIR8086_SEG_AR_TEST(seg)\ + if (vmcs_read32(GUEST_##seg##_AR_BYTES) != 0x0f3) { \ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x in "\ + "virtual8086 is not 0xf3\n",\ + __FUNCTION__,\ + vmcs_read32(GUEST_##seg##_AR_BYTES));\ + return 0;\ + } + + + cr0 = vmcs_readl(GUEST_CR0); + + if (!(cr0 & CR0_PG_MASK)) { + vcpu_printf(vcpu, "%s: cr0 0x%lx, PG is not set\n", + __FUNCTION__, cr0); + return 0; + } + + if (!(cr0 & CR0_PE_MASK)) { + vcpu_printf(vcpu, "%s: cr0 0x%lx, PE is not set\n", + __FUNCTION__, cr0); + return 0; + } + + if (!(cr0 & CR0_NE_MASK)) { + vcpu_printf(vcpu, "%s: cr0 0x%lx, NE is not set\n", + __FUNCTION__, cr0); + return 0; + } + + if (!(cr0 & CR0_WP_MASK)) { + vcpu_printf(vcpu, "%s: cr0 0x%lx, WP is not set\n", + __FUNCTION__, cr0); + } + + cr4 = vmcs_readl(GUEST_CR4); + + if (!(cr4 & CR4_VMXE_MASK)) { + vcpu_printf(vcpu, "%s: cr4 0x%lx, VMXE is not set\n", + __FUNCTION__, cr4); + return 0; + } + + if (!(cr4 & CR4_PAE_MASK)) { + vcpu_printf(vcpu, "%s: cr4 0x%lx, PAE is not set\n", + __FUNCTION__, cr4); + } + + ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); + + if (ia32_debugctl & IA32_DEBUGCTL_RESERVED_BITS ) { + vcpu_printf(vcpu, "%s: ia32_debugctl 0x%llx, reserve bits\n", + __FUNCTION__, ia32_debugctl); + return 0; + } + + long_mode = is_long_mode(); + + if (long_mode) { + } + + if ( long_mode && !(cr4 & CR4_PAE_MASK)) { + vcpu_printf(vcpu, "%s: long mode and not PAE\n", + __FUNCTION__); + return 0; + } + + cr3 = vmcs_readl(GUEST_CR3); + + if (cr3 & CR3_L_MODE_RESEVED_BITS) { + vcpu_printf(vcpu, "%s: cr3 0x%lx, reserved bits\n", + __FUNCTION__, cr3); + return 0; + } + + if ( !long_mode && (cr4 & CR4_PAE_MASK)) { + /* check the 4 PDPTEs for reserved bits */ + unsigned long pdpt_pfn = cr3 >> PAGE_SHIFT; + int i; + u64 pdpte; + unsigned offset = (cr3 & (PAGE_SIZE-1)) >> 5; + u64 *pdpt = kmap_atomic(pfn_to_page(pdpt_pfn), KM_USER0); + + for (i = 0; i < 4; ++i) { + pdpte = pdpt[offset + i]; + if ((pdpte & 1) && (pdpte & 0xfffffff0000001e6ull)) + break; + } + + kunmap_atomic(pdpt, KM_USER0); + + if (i != 4) { + vcpu_printf(vcpu, "%s: pae cr3[%d] 0x%llx, reserved bits\n", + __FUNCTION__, i, pdpte); + return 0; + } + } + + dr7 = vmcs_readl(GUEST_DR7); + + if (dr7 & ~((1ULL << 32) - 1)) { + vcpu_printf(vcpu, "%s: dr7 0x%lx, reserved bits\n", + __FUNCTION__, dr7); + return 0; + } + + sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP); + + if (!is_canonical(sysenter_esp)) { + vcpu_printf(vcpu, "%s: sysenter_esp 0x%lx, not canonical\n", + __FUNCTION__, sysenter_esp); + return 0; + } + + sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP); + + if (!is_canonical(sysenter_eip)) { + vcpu_printf(vcpu, "%s: sysenter_eip 0x%lx, not canonical\n", + __FUNCTION__, sysenter_eip); + return 0; + } + + rflags = vmcs_readl(GUEST_RFLAGS); + virtual8086 = rflags & RFLAGS_VM; + + + if (vmcs_read16(GUEST_TR_SELECTOR) & SELECTOR_TI_MASK) { + vcpu_printf(vcpu, "%s: tr selctor 0x%x, TI is set\n", + __FUNCTION__, vmcs_read16(GUEST_TR_SELECTOR)); + return 0; + } + + if (!(vmcs_read32(GUEST_LDTR_AR_BYTES) & AR_UNUSABLE_MASK) && + vmcs_read16(GUEST_LDTR_SELECTOR) & SELECTOR_TI_MASK) { + vcpu_printf(vcpu, "%s: ldtr selctor 0x%x," + " is usable and TI is set\n", + __FUNCTION__, vmcs_read16(GUEST_LDTR_SELECTOR)); + return 0; + } + + if (!virtual8086 && + (vmcs_read16(GUEST_SS_SELECTOR) & SELECTOR_RPL_MASK) != + (vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK)) { + vcpu_printf(vcpu, "%s: ss selctor 0x%x cs selctor 0x%x," + " not same RPL\n", + __FUNCTION__, + vmcs_read16(GUEST_SS_SELECTOR), + vmcs_read16(GUEST_CS_SELECTOR)); + return 0; + } + + if (virtual8086) { + VIR8086_SEG_BASE_TEST(CS); + VIR8086_SEG_BASE_TEST(SS); + VIR8086_SEG_BASE_TEST(DS); + VIR8086_SEG_BASE_TEST(ES); + VIR8086_SEG_BASE_TEST(FS); + VIR8086_SEG_BASE_TEST(GS); + } + + if (!is_canonical(vmcs_readl(GUEST_TR_BASE)) || + !is_canonical(vmcs_readl(GUEST_FS_BASE)) || + !is_canonical(vmcs_readl(GUEST_GS_BASE)) ) { + vcpu_printf(vcpu, "%s: TR 0x%lx FS 0x%lx or GS 0x%lx base" + " is not canonical\n", + __FUNCTION__, + vmcs_readl(GUEST_TR_BASE), + vmcs_readl(GUEST_FS_BASE), + vmcs_readl(GUEST_GS_BASE)); + return 0; + + } + + if (!(vmcs_read32(GUEST_LDTR_AR_BYTES) & AR_UNUSABLE_MASK) && + !is_canonical(vmcs_readl(GUEST_LDTR_BASE))) { + vcpu_printf(vcpu, "%s: LDTR base 0x%lx, usable and is not" + " canonical\n", + __FUNCTION__, + vmcs_readl(GUEST_LDTR_BASE)); + return 0; + } + + if ((vmcs_readl(GUEST_CS_BASE) & ~((1ULL << 32) - 1))) { + vcpu_printf(vcpu, "%s: CS base 0x%lx, not all bits 63-32" + " are zero\n", + __FUNCTION__, + vmcs_readl(GUEST_CS_BASE)); + return 0; + } + + #define SEG_BASE_TEST(seg)\ + if ( !(vmcs_read32(GUEST_##seg##_AR_BYTES) & AR_UNUSABLE_MASK) &&\ + (vmcs_readl(GUEST_##seg##_BASE) & ~((1ULL << 32) - 1))) {\ + vcpu_printf(vcpu, "%s: "#seg" base 0x%lx, is usable and not"\ + " all bits 63-32 are zero\n",\ + __FUNCTION__,\ + vmcs_readl(GUEST_##seg##_BASE));\ + return 0;\ + } + SEG_BASE_TEST(SS); + SEG_BASE_TEST(DS); + SEG_BASE_TEST(ES); + + if (virtual8086) { + VIR8086_SEG_LIMIT_TEST(CS); + VIR8086_SEG_LIMIT_TEST(SS); + VIR8086_SEG_LIMIT_TEST(DS); + VIR8086_SEG_LIMIT_TEST(ES); + VIR8086_SEG_LIMIT_TEST(FS); + VIR8086_SEG_LIMIT_TEST(GS); + } + + if (virtual8086) { + VIR8086_SEG_AR_TEST(CS); + VIR8086_SEG_AR_TEST(SS); + VIR8086_SEG_AR_TEST(DS); + VIR8086_SEG_AR_TEST(ES); + VIR8086_SEG_AR_TEST(FS); + VIR8086_SEG_AR_TEST(GS); + } else { + + u32 cs_ar = vmcs_read32(GUEST_CS_AR_BYTES); + u32 ss_ar = vmcs_read32(GUEST_SS_AR_BYTES); + u32 tr_ar = vmcs_read32(GUEST_TR_AR_BYTES); + u32 ldtr_ar = vmcs_read32(GUEST_LDTR_AR_BYTES); + + #define SEG_G_TEST(seg) { \ + u32 lim = vmcs_read32(GUEST_##seg##_LIMIT); \ + u32 ar = vmcs_read32(GUEST_##seg##_AR_BYTES); \ + int err = 0; \ + if (((lim & ~PAGE_MASK) != ~PAGE_MASK) && (ar & AR_G_MASK)) \ + err = 1; \ + if ((lim & ~((1u << 20) - 1)) && !(ar & AR_G_MASK)) \ + err = 1; \ + if (err) { \ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, G err. lim" \ + " is 0x%x\n", \ + __FUNCTION__, \ + ar, lim); \ + return 0; \ + } \ + } + + + if (!(cs_ar & AR_TYPE_ACCESSES_MASK)) { + vcpu_printf(vcpu, "%s: cs AR 0x%x, accesses is clear\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if (!(cs_ar & AR_TYPE_CODE_MASK)) { + vcpu_printf(vcpu, "%s: cs AR 0x%x, code is clear\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if (!(cs_ar & AR_S_MASK)) { + vcpu_printf(vcpu, "%s: cs AR 0x%x, type is sys\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if ((cs_ar & AR_TYPE_MASK) >= 8 && (cs_ar & AR_TYPE_MASK) < 12 && + AR_DPL(cs_ar) != + (vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK) ) { + vcpu_printf(vcpu, "%s: cs AR 0x%x, " + "DPL(0x%x) not as RPL(0x%x)\n", + __FUNCTION__, + cs_ar, AR_DPL(cs_ar), vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK); + return 0; + } + + if ((cs_ar & AR_TYPE_MASK) >= 13 && (cs_ar & AR_TYPE_MASK) < 16 && + AR_DPL(cs_ar) > + (vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK) ) { + vcpu_printf(vcpu, "%s: cs AR 0x%x, " + "DPL greater than RPL\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if (!(cs_ar & AR_P_MASK)) { + vcpu_printf(vcpu, "%s: CS AR 0x%x, not " + "present\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if ((cs_ar & AR_RESERVD_MASK)) { + vcpu_printf(vcpu, "%s: CS AR 0x%x, reseved" + " bits are set\n", + __FUNCTION__, + cs_ar); + return 0; + } + + if (long_mode & (cs_ar & AR_L_MASK) && (cs_ar & AR_DB_MASK)) { + vcpu_printf(vcpu, "%s: CS AR 0x%x, DB and L are set" + " in long mode\n", + __FUNCTION__, + cs_ar); + return 0; + + } + + SEG_G_TEST(CS); + + if (!(ss_ar & AR_UNUSABLE_MASK)) { + if ((ss_ar & AR_TYPE_MASK) != 3 && + (ss_ar & AR_TYPE_MASK) != 7 ) { + vcpu_printf(vcpu, "%s: ss AR 0x%x, usable and type" + " is not 3 or 7\n", + __FUNCTION__, + ss_ar); + return 0; + } + + if (!(ss_ar & AR_S_MASK)) { + vcpu_printf(vcpu, "%s: ss AR 0x%x, usable and" + " is sys\n", + __FUNCTION__, + ss_ar); + return 0; + } + if (!(ss_ar & AR_P_MASK)) { + vcpu_printf(vcpu, "%s: SS AR 0x%x, usable" + " and not present\n", + __FUNCTION__, + ss_ar); + return 0; + } + + if ((ss_ar & AR_RESERVD_MASK)) { + vcpu_printf(vcpu, "%s: SS AR 0x%x, reseved" + " bits are set\n", + __FUNCTION__, + ss_ar); + return 0; + } + + SEG_G_TEST(SS); + + } + + if (AR_DPL(ss_ar) != + (vmcs_read16(GUEST_SS_SELECTOR) & SELECTOR_RPL_MASK) ) { + vcpu_printf(vcpu, "%s: SS AR 0x%x, " + "DPL not as RPL\n", + __FUNCTION__, + ss_ar); + return 0; + } + + #define SEG_AR_TEST(seg) {\ + u32 ar = vmcs_read32(GUEST_##seg##_AR_BYTES);\ + if (!(ar & AR_UNUSABLE_MASK)) {\ + if (!(ar & AR_TYPE_ACCESSES_MASK)) {\ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, "\ + "usable and not accesses\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + if ((ar & AR_TYPE_CODE_MASK) &&\ + !(ar & AR_TYPE_READABLE_MASK)) {\ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, "\ + "code and not readable\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + if (!(ar & AR_S_MASK)) {\ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, usable and"\ + " is sys\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + if ((ar & AR_TYPE_MASK) >= 0 && \ + (ar & AR_TYPE_MASK) < 12 && \ + AR_DPL(ar) < (vmcs_read16(GUEST_##seg##_SELECTOR) & \ + SELECTOR_RPL_MASK) ) {\ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, "\ + "DPL less than RPL\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + if (!(ar & AR_P_MASK)) {\ + vcpu_printf(vcpu, "%s: "#seg" AR 0x%x, usable and"\ + " not present\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + if ((ar & AR_RESERVD_MASK)) {\ + vcpu_printf(vcpu, "%s: "#seg" AR"\ + " 0x%x, reseved"\ + " bits are set\n",\ + __FUNCTION__,\ + ar);\ + return 0;\ + }\ + SEG_G_TEST(seg)\ + }\ + } + +#undef DS +#undef ES +#undef FS +#undef GS + + SEG_AR_TEST(DS); + SEG_AR_TEST(ES); + SEG_AR_TEST(FS); + SEG_AR_TEST(GS); + + // TR test + if (long_mode) { + if ((tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) { + vcpu_printf(vcpu, "%s: TR AR 0x%x, long" + " mode and not 64bit busy" + " tss\n", + __FUNCTION__, + tr_ar); + return 0; + } + } else { + if ((tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_32_TSS && + (tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_16_TSS) { + vcpu_printf(vcpu, "%s: TR AR 0x%x, legacy" + " mode and not 16/32bit " + "busy tss\n", + __FUNCTION__, + tr_ar); + return 0; + } + + } + if ((tr_ar & AR_S_MASK)) { + vcpu_printf(vcpu, "%s: TR AR 0x%x, S is set\n", + __FUNCTION__, + tr_ar); + return 0; + } + if (!(tr_ar & AR_P_MASK)) { + vcpu_printf(vcpu, "%s: TR AR 0x%x, P is not set\n", + __FUNCTION__, + tr_ar); + return 0; + } + + if ((tr_ar & (AR_RESERVD_MASK| AR_UNUSABLE_MASK))) { + vcpu_printf(vcpu, "%s: TR AR 0x%x, reserved bit are" + " set\n", + __FUNCTION__, + tr_ar); + return 0; + } + SEG_G_TEST(TR); + + // TR test + if (!(ldtr_ar & AR_UNUSABLE_MASK)) { + + if ((ldtr_ar & AR_TYPE_MASK) != AR_TYPE_LDT) { + vcpu_printf(vcpu, "%s: LDTR AR 0x%x," + " bad type\n", + __FUNCTION__, + ldtr_ar); + return 0; + } + + if ((ldtr_ar & AR_S_MASK)) { + vcpu_printf(vcpu, "%s: LDTR AR 0x%x," + " S is set\n", + __FUNCTION__, + ldtr_ar); + return 0; + } + + if (!(ldtr_ar & AR_P_MASK)) { + vcpu_printf(vcpu, "%s: LDTR AR 0x%x," + " P is not set\n", + __FUNCTION__, + ldtr_ar); + return 0; + } + if ((ldtr_ar & AR_RESERVD_MASK)) { + vcpu_printf(vcpu, "%s: LDTR AR 0x%x," + " reserved bit are set\n", + __FUNCTION__, + ldtr_ar); + return 0; + } + SEG_G_TEST(LDTR); + } + } + + // GDTR and IDTR + + + #define IDT_GDT_TEST(reg)\ + if (!is_canonical(vmcs_readl(GUEST_##reg##_BASE))) {\ + vcpu_printf(vcpu, "%s: "#reg" BASE 0x%lx, not canonical\n",\ + __FUNCTION__,\ + vmcs_readl(GUEST_##reg##_BASE));\ + return 0;\ + }\ + if (vmcs_read32(GUEST_##reg##_LIMIT) >> 16) {\ + vcpu_printf(vcpu, "%s: "#reg" LIMIT 0x%x, size err\n",\ + __FUNCTION__,\ + vmcs_read32(GUEST_##reg##_LIMIT));\ + return 0;\ + }\ + + IDT_GDT_TEST(GDTR); + IDT_GDT_TEST(IDTR); + + + // RIP + + if ((!long_mode || !(vmcs_read32(GUEST_CS_AR_BYTES) & AR_L_MASK)) && + vmcs_readl(GUEST_RIP) & ~((1ULL << 32) - 1) ){ + vcpu_printf(vcpu, "%s: RIP 0x%lx, size err\n", + __FUNCTION__, + vmcs_readl(GUEST_RIP)); + return 0; + } + + if (!is_canonical(vmcs_readl(GUEST_RIP))) { + vcpu_printf(vcpu, "%s: RIP 0x%lx, not canonical\n", + __FUNCTION__, + vmcs_readl(GUEST_RIP)); + return 0; + } + + // RFLAGS + #define RFLAGS_RESEVED_CLEAR_BITS\ + (~((1ULL << 22) - 1) | (1ULL << 15) | (1ULL << 5) | (1ULL << 3)) + #define RFLAGS_RESEVED_SET_BITS (1 << 1) + + if ((rflags & RFLAGS_RESEVED_CLEAR_BITS) || + !(rflags & RFLAGS_RESEVED_SET_BITS)) { + vcpu_printf(vcpu, "%s: RFLAGS 0x%lx, reserved bits 0x%llx 0x%x\n", + __FUNCTION__, + rflags, + RFLAGS_RESEVED_CLEAR_BITS, + RFLAGS_RESEVED_SET_BITS); + return 0; + } + + if (long_mode && virtual8086) { + vcpu_printf(vcpu, "%s: RFLAGS 0x%lx, vm and long mode\n", + __FUNCTION__, + rflags); + return 0; + } + + + if (!(rflags & RFLAGS_RF)) { + u32 vm_entry_info = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); + if ((vm_entry_info & INTR_INFO_VALID_MASK) && + (vm_entry_info & INTR_INFO_INTR_TYPE_MASK) == + INTR_TYPE_EXT_INTR) { + vcpu_printf(vcpu, "%s: RFLAGS 0x%lx, external" + " interrupt and RF is clear\n", + __FUNCTION__, + rflags); + return 0; + } + + } + + // to be continued from Checks on Guest Non-Register State (22.3.1.5) + return 1; +} + +static int check_fixed_bits(struct kvm_vcpu *vcpu, const char *reg, + unsigned long cr, + u32 msr_fixed_0, u32 msr_fixed_1) +{ + u64 fixed_bits_0, fixed_bits_1; + + rdmsrl(msr_fixed_0, fixed_bits_0); + rdmsrl(msr_fixed_1, fixed_bits_1); + if ((cr & fixed_bits_0) != fixed_bits_0) { + vcpu_printf(vcpu, "%s: %s (%lx) has one of %llx unset\n", + __FUNCTION__, reg, cr, fixed_bits_0); + return 0; + } + if ((~cr & ~fixed_bits_1) != ~fixed_bits_1) { + vcpu_printf(vcpu, "%s: %s (%lx) has one of %llx set\n", + __FUNCTION__, reg, cr, ~fixed_bits_1); + return 0; + } + return 1; +} + +static int phys_addr_width(void) +{ + unsigned eax, ebx, ecx, edx; + + cpuid(0x80000008, &eax, &ebx, &ecx, &edx); + return eax & 0xff; +} + +static int check_canonical(struct kvm_vcpu *vcpu, const char *name, + unsigned long reg) +{ +#ifdef __x86_64__ + unsigned long x; + + if (sizeof(reg) == 4) + return 1; + x = (long)reg >> 48; + if (!(x == 0 || x == ~0UL)) { + vcpu_printf(vcpu, "%s: %s (%lx) not canonical\n", + __FUNCTION__, name, reg); + return 0; + } +#endif + return 1; +} + +static int check_selector(struct kvm_vcpu *vcpu, const char *name, + int rpl_ti, int null, + u16 sel) +{ + if (rpl_ti && (sel & 7)) { + vcpu_printf(vcpu, "%s: %s (%x) nonzero rpl or ti\n", + __FUNCTION__, name, sel); + return 0; + } + if (null && !sel) { + vcpu_printf(vcpu, "%s: %s (%x) zero\n", + __FUNCTION__, name, sel); + return 0; + } + return 1; +} + +#define MSR_IA32_VMX_CR0_FIXED0 0x486 +#define MSR_IA32_VMX_CR0_FIXED1 0x487 + +#define MSR_IA32_VMX_CR4_FIXED0 0x488 +#define MSR_IA32_VMX_CR4_FIXED1 0x489 + +int vm_entry_test_host(struct kvm_vcpu *vcpu) +{ + int r = 0; + unsigned long cr0 = vmcs_readl(HOST_CR0); + unsigned long cr4 = vmcs_readl(HOST_CR4); + unsigned long cr3 = vmcs_readl(HOST_CR3); + int host_64; + + host_64 = vmcs_read32(VM_EXIT_CONTROLS) & VM_EXIT_HOST_ADD_SPACE_SIZE; + + /* 22.2.2 */ + r &= check_fixed_bits(vcpu, "host cr0", cr0, MSR_IA32_VMX_CR0_FIXED0, + MSR_IA32_VMX_CR0_FIXED1); + + r &= check_fixed_bits(vcpu, "host cr0", cr4, MSR_IA32_VMX_CR4_FIXED0, + MSR_IA32_VMX_CR4_FIXED1); + if ((u64)cr3 >> phys_addr_width()) { + vcpu_printf(vcpu, "%s: cr3 (%lx) vs phys addr width\n", + __FUNCTION__, cr3); + r = 0; + } + + r &= check_canonical(vcpu, "host ia32_sysenter_eip", + vmcs_readl(HOST_IA32_SYSENTER_EIP)); + r &= check_canonical(vcpu, "host ia32_sysenter_esp", + vmcs_readl(HOST_IA32_SYSENTER_ESP)); + + /* 22.2.3 */ + r &= check_selector(vcpu, "host cs", 1, 1, + vmcs_read16(HOST_CS_SELECTOR)); + r &= check_selector(vcpu, "host ss", 1, !host_64, + vmcs_read16(HOST_SS_SELECTOR)); + r &= check_selector(vcpu, "host ds", 1, 0, + vmcs_read16(HOST_DS_SELECTOR)); + r &= check_selector(vcpu, "host es", 1, 0, + vmcs_read16(HOST_ES_SELECTOR)); + r &= check_selector(vcpu, "host fs", 1, 0, + vmcs_read16(HOST_FS_SELECTOR)); + r &= check_selector(vcpu, "host gs", 1, 0, + vmcs_read16(HOST_GS_SELECTOR)); + r &= check_selector(vcpu, "host tr", 1, 1, + vmcs_read16(HOST_TR_SELECTOR)); + +#ifdef __x86_64__ + r &= check_canonical(vcpu, "host fs base", + vmcs_readl(HOST_FS_BASE)); + r &= check_canonical(vcpu, "host gs base", + vmcs_readl(HOST_GS_BASE)); + r &= check_canonical(vcpu, "host gdtr base", + vmcs_readl(HOST_GDTR_BASE)); + r &= check_canonical(vcpu, "host idtr base", + vmcs_readl(HOST_IDTR_BASE)); +#endif + + /* 22.2.4 */ +#ifdef __x86_64__ + if (!host_64) { + vcpu_printf(vcpu, "%s: vm exit controls: !64 bit host\n", + __FUNCTION__); + r = 0; + } + if (!(cr4 & CR4_PAE_MASK)) { + vcpu_printf(vcpu, "%s: cr4 (%lx): !pae\n", + __FUNCTION__, cr4); + r = 0; + } + r &= check_canonical(vcpu, "host rip", vmcs_readl(HOST_RIP)); +#endif + + return r; +} + +int vm_entry_test(struct kvm_vcpu *vcpu) +{ + int rg, rh; + + rg = vm_entry_test_guest(vcpu); + rh = vm_entry_test_host(vcpu); + return rg && rh; +} + +void vmcs_dump(struct kvm_vcpu *vcpu) +{ + vcpu_printf(vcpu, "************************ vmcs_dump ************************\n"); + vcpu_printf(vcpu, "VM_ENTRY_CONTROLS 0x%x\n", vmcs_read32(VM_ENTRY_CONTROLS)); + + vcpu_printf(vcpu, "GUEST_CR0 0x%lx\n", vmcs_readl(GUEST_CR0)); + vcpu_printf(vcpu, "GUEST_CR3 0x%lx\n", vmcs_readl(GUEST_CR3)); + vcpu_printf(vcpu, "GUEST_CR4 0x%lx\n", vmcs_readl(GUEST_CR4)); + + vcpu_printf(vcpu, "GUEST_SYSENTER_ESP 0x%lx\n", vmcs_readl(GUEST_SYSENTER_ESP)); + vcpu_printf(vcpu, "GUEST_SYSENTER_EIP 0x%lx\n", vmcs_readl(GUEST_SYSENTER_EIP)); + + + vcpu_printf(vcpu, "GUEST_IA32_DEBUGCTL 0x%llx\n", vmcs_read64(GUEST_IA32_DEBUGCTL)); + vcpu_printf(vcpu, "GUEST_DR7 0x%lx\n", vmcs_readl(GUEST_DR7)); + + vcpu_printf(vcpu, "GUEST_RFLAGS 0x%lx\n", vmcs_readl(GUEST_RFLAGS)); + vcpu_printf(vcpu, "GUEST_RIP 0x%lx\n", vmcs_readl(GUEST_RIP)); + + vcpu_printf(vcpu, "GUEST_CS_SELECTOR 0x%x\n", vmcs_read16(GUEST_CS_SELECTOR)); + vcpu_printf(vcpu, "GUEST_DS_SELECTOR 0x%x\n", vmcs_read16(GUEST_DS_SELECTOR)); + vcpu_printf(vcpu, "GUEST_ES_SELECTOR 0x%x\n", vmcs_read16(GUEST_ES_SELECTOR)); + vcpu_printf(vcpu, "GUEST_FS_SELECTOR 0x%x\n", vmcs_read16(GUEST_FS_SELECTOR)); + vcpu_printf(vcpu, "GUEST_GS_SELECTOR 0x%x\n", vmcs_read16(GUEST_GS_SELECTOR)); + vcpu_printf(vcpu, "GUEST_SS_SELECTOR 0x%x\n", vmcs_read16(GUEST_SS_SELECTOR)); + + vcpu_printf(vcpu, "GUEST_TR_SELECTOR 0x%x\n", vmcs_read16(GUEST_TR_SELECTOR)); + vcpu_printf(vcpu, "GUEST_LDTR_SELECTOR 0x%x\n", vmcs_read16(GUEST_LDTR_SELECTOR)); + + vcpu_printf(vcpu, "GUEST_CS_AR_BYTES 0x%x\n", vmcs_read32(GUEST_CS_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_DS_AR_BYTES 0x%x\n", vmcs_read32(GUEST_DS_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_ES_AR_BYTES 0x%x\n", vmcs_read32(GUEST_ES_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_FS_AR_BYTES 0x%x\n", vmcs_read32(GUEST_FS_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_GS_AR_BYTES 0x%x\n", vmcs_read32(GUEST_GS_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_SS_AR_BYTES 0x%x\n", vmcs_read32(GUEST_SS_AR_BYTES)); + + vcpu_printf(vcpu, "GUEST_LDTR_AR_BYTES 0x%x\n", vmcs_read32(GUEST_LDTR_AR_BYTES)); + vcpu_printf(vcpu, "GUEST_TR_AR_BYTES 0x%x\n", vmcs_read32(GUEST_TR_AR_BYTES)); + + vcpu_printf(vcpu, "GUEST_CS_BASE 0x%lx\n", vmcs_readl(GUEST_CS_BASE)); + vcpu_printf(vcpu, "GUEST_DS_BASE 0x%lx\n", vmcs_readl(GUEST_DS_BASE)); + vcpu_printf(vcpu, "GUEST_ES_BASE 0x%lx\n", vmcs_readl(GUEST_ES_BASE)); + vcpu_printf(vcpu, "GUEST_FS_BASE 0x%lx\n", vmcs_readl(GUEST_FS_BASE)); + vcpu_printf(vcpu, "GUEST_GS_BASE 0x%lx\n", vmcs_readl(GUEST_GS_BASE)); + vcpu_printf(vcpu, "GUEST_SS_BASE 0x%lx\n", vmcs_readl(GUEST_SS_BASE)); + + + vcpu_printf(vcpu, "GUEST_LDTR_BASE 0x%lx\n", vmcs_readl(GUEST_LDTR_BASE)); + vcpu_printf(vcpu, "GUEST_TR_BASE 0x%lx\n", vmcs_readl(GUEST_TR_BASE)); + + vcpu_printf(vcpu, "GUEST_CS_LIMIT 0x%x\n", vmcs_read32(GUEST_CS_LIMIT)); + vcpu_printf(vcpu, "GUEST_DS_LIMIT 0x%x\n", vmcs_read32(GUEST_DS_LIMIT)); + vcpu_printf(vcpu, "GUEST_ES_LIMIT 0x%x\n", vmcs_read32(GUEST_ES_LIMIT)); + vcpu_printf(vcpu, "GUEST_FS_LIMIT 0x%x\n", vmcs_read32(GUEST_FS_LIMIT)); + vcpu_printf(vcpu, "GUEST_GS_LIMIT 0x%x\n", vmcs_read32(GUEST_GS_LIMIT)); + vcpu_printf(vcpu, "GUEST_SS_LIMIT 0x%x\n", vmcs_read32(GUEST_SS_LIMIT)); + + vcpu_printf(vcpu, "GUEST_LDTR_LIMIT 0x%x\n", vmcs_read32(GUEST_LDTR_LIMIT)); + vcpu_printf(vcpu, "GUEST_TR_LIMIT 0x%x\n", vmcs_read32(GUEST_TR_LIMIT)); + + vcpu_printf(vcpu, "GUEST_GDTR_BASE 0x%lx\n", vmcs_readl(GUEST_GDTR_BASE)); + vcpu_printf(vcpu, "GUEST_IDTR_BASE 0x%lx\n", vmcs_readl(GUEST_IDTR_BASE)); + + vcpu_printf(vcpu, "GUEST_GDTR_LIMIT 0x%x\n", vmcs_read32(GUEST_GDTR_LIMIT)); + vcpu_printf(vcpu, "GUEST_IDTR_LIMIT 0x%x\n", vmcs_read32(GUEST_IDTR_LIMIT)); + + vcpu_printf(vcpu, "EXCEPTION_BITMAP 0x%x\n", vmcs_read32(EXCEPTION_BITMAP)); + vcpu_printf(vcpu, "***********************************************************\n"); +} + +void regs_dump(struct kvm_vcpu *vcpu) +{ + #define REG_DUMP(reg) \ + vcpu_printf(vcpu, #reg" = 0x%lx(VCPU)\n", vcpu->regs[VCPU_REGS_##reg]) + #define VMCS_REG_DUMP(reg) \ + vcpu_printf(vcpu, #reg" = 0x%lx(VMCS)\n", vmcs_readl(GUEST_##reg)) + + vcpu_printf(vcpu, "************************ regs_dump ************************\n"); + REG_DUMP(RAX); + REG_DUMP(RBX); + REG_DUMP(RCX); + REG_DUMP(RDX); + REG_DUMP(RSP); + REG_DUMP(RBP); + REG_DUMP(RSI); + REG_DUMP(RDI); + REG_DUMP(R8); + REG_DUMP(R9); + REG_DUMP(R10); + REG_DUMP(R11); + REG_DUMP(R12); + REG_DUMP(R13); + REG_DUMP(R14); + REG_DUMP(R15); + + VMCS_REG_DUMP(RSP); + VMCS_REG_DUMP(RIP); + VMCS_REG_DUMP(RFLAGS); + + vcpu_printf(vcpu, "***********************************************************\n"); +} + +void sregs_dump(struct kvm_vcpu *vcpu) +{ + vcpu_printf(vcpu, "************************ sregs_dump ************************\n"); + vcpu_printf(vcpu, "cr0 = 0x%lx\n", guest_cr0()); + vcpu_printf(vcpu, "cr2 = 0x%lx\n", vcpu->cr2); + vcpu_printf(vcpu, "cr3 = 0x%lx\n", vcpu->cr3); + vcpu_printf(vcpu, "cr4 = 0x%lx\n", guest_cr4()); + vcpu_printf(vcpu, "cr8 = 0x%lx\n", vcpu->cr8); + vcpu_printf(vcpu, "shadow_efer = 0x%llx\n", vcpu->shadow_efer); + vcpu_printf(vcpu, "***********************************************************\n"); +} + +void show_pending_interrupts(struct kvm_vcpu *vcpu) +{ + int i; + vcpu_printf(vcpu, "************************ pending interrupts ****************\n"); + vcpu_printf(vcpu, "sumamry = 0x%lx\n", vcpu->irq_summary); + for (i=0 ; i < NR_IRQ_WORDS ; i++) + vcpu_printf(vcpu, "%lx ", vcpu->irq_pending[i]); + vcpu_printf(vcpu, "\n"); + vcpu_printf(vcpu, "************************************************************\n"); +} + +void vcpu_dump(struct kvm_vcpu *vcpu) +{ + regs_dump(vcpu); + sregs_dump(vcpu); + vmcs_dump(vcpu); + show_msrs(vcpu); + show_pending_interrupts(vcpu); + /* more ... */ +} +#endif + Index: linux/drivers/kvm/debug.h =================================================================== --- /dev/null +++ linux/drivers/kvm/debug.h @@ -0,0 +1,23 @@ +#ifndef __KVM_DEBUG_H +#define __KVM_DEBUG_H + +#ifdef KVM_DEBUG + +void show_msrs(struct kvm_vcpu *vcpu); + + +void show_irq(struct kvm_vcpu *vcpu, int irq); +void show_page(struct kvm_vcpu *vcpu, gva_t addr); +void show_u64(struct kvm_vcpu *vcpu, gva_t addr); +void show_code(struct kvm_vcpu *vcpu); +int vm_entry_test(struct kvm_vcpu *vcpu); + +void vmcs_dump(struct kvm_vcpu *vcpu); +void regs_dump(struct kvm_vcpu *vcpu); +void sregs_dump(struct kvm_vcpu *vcpu); +void show_pending_interrupts(struct kvm_vcpu *vcpu); +void vcpu_dump(struct kvm_vcpu *vcpu); + +#endif + +#endif Index: linux/drivers/kvm/kvm.h =================================================================== --- /dev/null +++ linux/drivers/kvm/kvm.h @@ -0,0 +1,551 @@ +#ifndef __KVM_H +#define __KVM_H + +/* + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include +#include +#include +#include +#include + +#include "vmx.h" +#include + +#define CR0_PE_MASK (1ULL << 0) +#define CR0_TS_MASK (1ULL << 3) +#define CR0_NE_MASK (1ULL << 5) +#define CR0_WP_MASK (1ULL << 16) +#define CR0_NW_MASK (1ULL << 29) +#define CR0_CD_MASK (1ULL << 30) +#define CR0_PG_MASK (1ULL << 31) + +#define CR3_WPT_MASK (1ULL << 3) +#define CR3_PCD_MASK (1ULL << 4) + +#define CR3_RESEVED_BITS 0x07ULL +#define CR3_L_MODE_RESEVED_BITS (~((1ULL << 40) - 1) | 0x0fe7ULL) +#define CR3_FLAGS_MASK ((1ULL << 5) - 1) + +#define CR4_VME_MASK (1ULL << 0) +#define CR4_PSE_MASK (1ULL << 4) +#define CR4_PAE_MASK (1ULL << 5) +#define CR4_PGE_MASK (1ULL << 7) +#define CR4_VMXE_MASK (1ULL << 13) + +#define KVM_GUEST_CR0_MASK \ + (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK \ + | CR0_NW_MASK | CR0_CD_MASK) +#define KVM_VM_CR0_ALWAYS_ON \ + (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK) +#define KVM_GUEST_CR4_MASK \ + (CR4_PSE_MASK | CR4_PAE_MASK | CR4_PGE_MASK | CR4_VMXE_MASK | CR4_VME_MASK) +#define KVM_PMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK) +#define KVM_RMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK | CR4_VME_MASK) + +#define INVALID_PAGE (~(hpa_t)0) +#define UNMAPPED_GVA (~(gpa_t)0) + +#define KVM_MAX_VCPUS 1 +#define KVM_MEMORY_SLOTS 4 +#define KVM_NUM_MMU_PAGES 256 + +#define FX_IMAGE_SIZE 512 +#define FX_IMAGE_ALIGN 16 +#define FX_BUF_SIZE (2 * FX_IMAGE_SIZE + FX_IMAGE_ALIGN) + +#define DE_VECTOR 0 +#define DF_VECTOR 8 +#define TS_VECTOR 10 +#define NP_VECTOR 11 +#define SS_VECTOR 12 +#define GP_VECTOR 13 +#define PF_VECTOR 14 + +#define SELECTOR_TI_MASK (1 << 2) +#define SELECTOR_RPL_MASK 0x03 + +#define IOPL_SHIFT 12 + +/* + * Address types: + * + * gva - guest virtual address + * gpa - guest physical address + * gfn - guest frame number + * hva - host virtual address + * hpa - host physical address + * hfn - host frame number + */ + +typedef unsigned long gva_t; +typedef u64 gpa_t; +typedef unsigned long gfn_t; + +typedef unsigned long hva_t; +typedef u64 hpa_t; +typedef unsigned long hfn_t; + +struct kvm_mmu_page { + struct list_head link; + hpa_t page_hpa; + unsigned long slot_bitmap; /* One bit set per slot which has memory + * in this shadow page. + */ + int global; /* Set if all ptes in this page are global */ + u64 *parent_pte; +}; + +struct vmcs { + u32 revision_id; + u32 abort; + char data[0]; +}; + +#define vmx_msr_entry kvm_msr_entry + +struct kvm_vcpu; + +/* + * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level + * 32-bit). The kvm_mmu structure abstracts the details of the current mmu + * mode. + */ +struct kvm_mmu { + void (*new_cr3)(struct kvm_vcpu *vcpu); + int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err); + void (*inval_page)(struct kvm_vcpu *vcpu, gva_t gva); + void (*free)(struct kvm_vcpu *vcpu); + gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva); + hpa_t root_hpa; + int root_level; + int shadow_root_level; +}; + +struct kvm_guest_debug { + int enabled; + unsigned long bp[4]; + int singlestep; +}; + +enum { + VCPU_REGS_RAX = 0, + VCPU_REGS_RCX = 1, + VCPU_REGS_RDX = 2, + VCPU_REGS_RBX = 3, + VCPU_REGS_RSP = 4, + VCPU_REGS_RBP = 5, + VCPU_REGS_RSI = 6, + VCPU_REGS_RDI = 7, +#ifdef __x86_64__ + VCPU_REGS_R8 = 8, + VCPU_REGS_R9 = 9, + VCPU_REGS_R10 = 10, + VCPU_REGS_R11 = 11, + VCPU_REGS_R12 = 12, + VCPU_REGS_R13 = 13, + VCPU_REGS_R14 = 14, + VCPU_REGS_R15 = 15, +#endif + NR_VCPU_REGS +}; + +enum { + VCPU_SREG_CS, + VCPU_SREG_DS, + VCPU_SREG_ES, + VCPU_SREG_FS, + VCPU_SREG_GS, + VCPU_SREG_SS, + VCPU_SREG_TR, + VCPU_SREG_LDTR, +}; + +struct kvm_vcpu { + struct kvm *kvm; + union { + struct vmcs *vmcs; + struct vcpu_svm *svm; + }; + struct mutex mutex; + int cpu; + int launched; + unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */ +#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) + unsigned long irq_pending[NR_IRQ_WORDS]; + unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */ + unsigned long rip; /* needs vcpu_load_rsp_rip() */ + + unsigned long cr0; + unsigned long cr2; + unsigned long cr3; + unsigned long cr4; + unsigned long cr8; + u64 shadow_efer; + u64 apic_base; + int nmsrs; + struct vmx_msr_entry *guest_msrs; + struct vmx_msr_entry *host_msrs; + + struct list_head free_pages; + struct kvm_mmu_page page_header_buf[KVM_NUM_MMU_PAGES]; + struct kvm_mmu mmu; + + struct kvm_guest_debug guest_debug; + + char fx_buf[FX_BUF_SIZE]; + char *host_fx_image; + char *guest_fx_image; + + int mmio_needed; + int mmio_read_completed; + int mmio_is_write; + int mmio_size; + unsigned char mmio_data[8]; + gpa_t mmio_phys_addr; + + struct { + int active; + u8 save_iopl; + struct kvm_save_segment { + u16 selector; + unsigned long base; + u32 limit; + u32 ar; + } tr, es, ds, fs, gs; + } rmode; +}; + +struct kvm_memory_slot { + gfn_t base_gfn; + unsigned long npages; + unsigned long flags; + struct page **phys_mem; + unsigned long *dirty_bitmap; +}; + +struct kvm { + spinlock_t lock; /* protects everything except vcpus */ + int nmemslots; + struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS]; + struct list_head active_mmu_pages; + struct kvm_vcpu vcpus[KVM_MAX_VCPUS]; + int memory_config_version; + int busy; +}; + +struct kvm_stat { + u32 pf_fixed; + u32 pf_guest; + u32 tlb_flush; + u32 invlpg; + + u32 exits; + u32 io_exits; + u32 mmio_exits; + u32 signal_exits; + u32 irq_exits; +}; + +struct descriptor_table { + u16 limit; + unsigned long base; +} __attribute__((packed)); + +struct kvm_arch_ops { + int (*cpu_has_kvm_support)(void); /* __init */ + int (*disabled_by_bios)(void); /* __init */ + void (*hardware_enable)(void *dummy); /* __init */ + void (*hardware_disable)(void *dummy); + int (*hardware_setup)(void); /* __init */ + void (*hardware_unsetup)(void); /* __exit */ + + int (*vcpu_create)(struct kvm_vcpu *vcpu); + void (*vcpu_free)(struct kvm_vcpu *vcpu); + + struct kvm_vcpu *(*vcpu_load)(struct kvm_vcpu *vcpu); + void (*vcpu_put)(struct kvm_vcpu *vcpu); + + int (*set_guest_debug)(struct kvm_vcpu *vcpu, + struct kvm_debug_guest *dbg); + int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); + int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); + void (*get_segment)(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg); + void (*set_segment)(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg); + int (*is_long_mode)(struct kvm_vcpu *vcpu); + void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l); + void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0); + void (*set_cr0_no_modeswitch)(struct kvm_vcpu *vcpu, + unsigned long cr0); + void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); + void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4); + void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer); + void (*get_idt)(struct kvm_vcpu *vcpu, struct descriptor_table *dt); + void (*set_idt)(struct kvm_vcpu *vcpu, struct descriptor_table *dt); + void (*get_gdt)(struct kvm_vcpu *vcpu, struct descriptor_table *dt); + void (*set_gdt)(struct kvm_vcpu *vcpu, struct descriptor_table *dt); + unsigned long (*get_dr)(struct kvm_vcpu *vcpu, int dr); + void (*set_dr)(struct kvm_vcpu *vcpu, int dr, unsigned long value, + int *exception); + void (*cache_regs)(struct kvm_vcpu *vcpu); + void (*decache_regs)(struct kvm_vcpu *vcpu); + unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); + void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags); + + void (*invlpg)(struct kvm_vcpu *vcpu, gva_t addr); + void (*tlb_flush)(struct kvm_vcpu *vcpu); + void (*inject_page_fault)(struct kvm_vcpu *vcpu, + unsigned long addr, u32 err_code); + + void (*inject_gp)(struct kvm_vcpu *vcpu, unsigned err_code); + + int (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run); + int (*vcpu_setup)(struct kvm_vcpu *vcpu); + void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); +}; + +extern struct kvm_stat kvm_stat; +extern struct kvm_arch_ops *kvm_arch_ops; + +#define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt) +#define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt) + +int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module); +void kvm_exit_arch(void); + +void kvm_mmu_destroy(struct kvm_vcpu *vcpu); +int kvm_mmu_init(struct kvm_vcpu *vcpu); + +int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); +void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); + +hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa); +#define HPA_MSB ((sizeof(hpa_t) * 8) - 1) +#define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB) +static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; } +hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva); + +void kvm_emulator_want_group7_invlpg(void); + +extern hpa_t bad_page_address; + +static inline struct page *gfn_to_page(struct kvm_memory_slot *slot, gfn_t gfn) +{ + return slot->phys_mem[gfn - slot->base_gfn]; +} + +struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); +void mark_page_dirty(struct kvm *kvm, gfn_t gfn); + +enum emulation_result { + EMULATE_DONE, /* no further processing */ + EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ + EMULATE_FAIL, /* can't emulate this instruction */ +}; + +int emulate_instruction(struct kvm_vcpu *vcpu, struct kvm_run *run, + unsigned long cr2, u16 error_code); +void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); +void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); +void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, + unsigned long *rflags); + +unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr); +void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value, + unsigned long *rflags); + +struct x86_emulate_ctxt; + +int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address); +int emulate_clts(struct kvm_vcpu *vcpu); +int emulator_get_dr(struct x86_emulate_ctxt* ctxt, int dr, + unsigned long *dest); +int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr, + unsigned long value); + +void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); +void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr0); +void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr0); +void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr0); +void lmsw(struct kvm_vcpu *vcpu, unsigned long msw); + +#ifdef __x86_64__ +void set_efer(struct kvm_vcpu *vcpu, u64 efer); +#endif + +void fx_init(struct kvm_vcpu *vcpu); + +void load_msrs(struct vmx_msr_entry *e, int n); +void save_msrs(struct vmx_msr_entry *e, int n); +void kvm_resched(struct kvm_vcpu *vcpu); + +int kvm_read_guest(struct kvm_vcpu *vcpu, + gva_t addr, + unsigned long size, + void *dest); + +int kvm_write_guest(struct kvm_vcpu *vcpu, + gva_t addr, + unsigned long size, + void *data); + +unsigned long segment_base(u16 selector); + +static inline struct page *_gfn_to_page(struct kvm *kvm, gfn_t gfn) +{ + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn); + return (slot) ? slot->phys_mem[gfn - slot->base_gfn] : 0; +} + +static inline int is_pae(struct kvm_vcpu *vcpu) +{ + return vcpu->cr4 & CR4_PAE_MASK; +} + +static inline int is_pse(struct kvm_vcpu *vcpu) +{ + return vcpu->cr4 & CR4_PSE_MASK; +} + +static inline int is_paging(struct kvm_vcpu *vcpu) +{ + return vcpu->cr0 & CR0_PG_MASK; +} + +static inline int memslot_id(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + return slot - kvm->memslots; +} + +static inline struct kvm_mmu_page *page_header(hpa_t shadow_page) +{ + struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT); + + return (struct kvm_mmu_page *)page->private; +} + +static inline u16 read_fs(void) +{ + u16 seg; + asm ("mov %%fs, %0" : "=g"(seg)); + return seg; +} + +static inline u16 read_gs(void) +{ + u16 seg; + asm ("mov %%gs, %0" : "=g"(seg)); + return seg; +} + +static inline u16 read_ldt(void) +{ + u16 ldt; + asm ("sldt %0" : "=g"(ldt)); + return ldt; +} + +static inline void load_fs(u16 sel) +{ + asm ("mov %0, %%fs" : : "rm"(sel)); +} + +static inline void load_gs(u16 sel) +{ + asm ("mov %0, %%gs" : : "rm"(sel)); +} + +#ifndef load_ldt +static inline void load_ldt(u16 sel) +{ + asm ("lldt %0" : : "g"(sel)); +} +#endif + +static inline void get_idt(struct descriptor_table *table) +{ + asm ("sidt %0" : "=m"(*table)); +} + +static inline void get_gdt(struct descriptor_table *table) +{ + asm ("sgdt %0" : "=m"(*table)); +} + +static inline unsigned long read_tr_base(void) +{ + u16 tr; + asm ("str %0" : "=g"(tr)); + return segment_base(tr); +} + +#ifdef __x86_64__ +static inline unsigned long read_msr(unsigned long msr) +{ + u64 value; + + rdmsrl(msr, value); + return value; +} +#endif + +static inline void fx_save(void *image) +{ + asm ("fxsave (%0)":: "r" (image)); +} + +static inline void fx_restore(void *image) +{ + asm ("fxrstor (%0)":: "r" (image)); +} + +static inline void fpu_init(void) +{ + asm ("finit"); +} + +static inline u32 get_rdx_init_val(void) +{ + return 0x600; /* P6 family */ +} + +#define ASM_VMX_VMCLEAR_RAX ".byte 0x66, 0x0f, 0xc7, 0x30" +#define ASM_VMX_VMLAUNCH ".byte 0x0f, 0x01, 0xc2" +#define ASM_VMX_VMRESUME ".byte 0x0f, 0x01, 0xc3" +#define ASM_VMX_VMPTRLD_RAX ".byte 0x0f, 0xc7, 0x30" +#define ASM_VMX_VMREAD_RDX_RAX ".byte 0x0f, 0x78, 0xd0" +#define ASM_VMX_VMWRITE_RAX_RDX ".byte 0x0f, 0x79, 0xd0" +#define ASM_VMX_VMWRITE_RSP_RDX ".byte 0x0f, 0x79, 0xd4" +#define ASM_VMX_VMXOFF ".byte 0x0f, 0x01, 0xc4" +#define ASM_VMX_VMXON_RAX ".byte 0xf3, 0x0f, 0xc7, 0x30" + +#define MSR_IA32_TIME_STAMP_COUNTER 0x010 + +#define TSS_IOPB_BASE_OFFSET 0x66 +#define TSS_BASE_SIZE 0x68 +#define TSS_IOPB_SIZE (65536 / 8) +#define TSS_REDIRECTION_SIZE (256 / 8) +#define RMODE_TSS_SIZE (TSS_BASE_SIZE + TSS_REDIRECTION_SIZE + TSS_IOPB_SIZE + 1) + +#ifdef __x86_64__ + +/* + * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. Therefore + * we need to allocate shadow page tables in the first 4GB of memory, which + * happens to fit the DMA32 zone. + */ +#define GFP_KVM_MMU (GFP_KERNEL | __GFP_DMA32) + +#else + +#define GFP_KVM_MMU GFP_KERNEL + +#endif + +#endif Index: linux/drivers/kvm/kvm_main.c =================================================================== --- /dev/null +++ linux/drivers/kvm/kvm_main.c @@ -0,0 +1,1914 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables machines with Intel VT-x extensions to run virtual + * machines without emulation or binary translation. + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Avi Kivity + * Yaniv Kamay + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include "kvm.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "x86_emulate.h" +#include "segment_descriptor.h" + +MODULE_AUTHOR("Qumranet"); +MODULE_LICENSE("GPL"); + +struct kvm_arch_ops *kvm_arch_ops; +struct kvm_stat kvm_stat; +EXPORT_SYMBOL_GPL(kvm_stat); + +static struct kvm_stats_debugfs_item { + const char *name; + u32 *data; + struct dentry *dentry; +} debugfs_entries[] = { + { "pf_fixed", &kvm_stat.pf_fixed }, + { "pf_guest", &kvm_stat.pf_guest }, + { "tlb_flush", &kvm_stat.tlb_flush }, + { "invlpg", &kvm_stat.invlpg }, + { "exits", &kvm_stat.exits }, + { "io_exits", &kvm_stat.io_exits }, + { "mmio_exits", &kvm_stat.mmio_exits }, + { "signal_exits", &kvm_stat.signal_exits }, + { "irq_exits", &kvm_stat.irq_exits }, + { 0, 0 } +}; + +static struct dentry *debugfs_dir; + +#define MAX_IO_MSRS 256 + +#define CR0_RESEVED_BITS 0xffffffff1ffaffc0ULL +#define LMSW_GUEST_MASK 0x0eULL +#define CR4_RESEVED_BITS (~((1ULL << 11) - 1)) +#define CR8_RESEVED_BITS (~0x0fULL) +#define EFER_RESERVED_BITS 0xfffffffffffff2fe + +#ifdef __x86_64__ +// LDT or TSS descriptor in the GDT. 16 bytes. +struct segment_descriptor_64 { + struct segment_descriptor s; + u32 base_higher; + u32 pad_zero; +}; + +#endif + +unsigned long segment_base(u16 selector) +{ + struct descriptor_table gdt; + struct segment_descriptor *d; + unsigned long table_base; + typedef unsigned long ul; + unsigned long v; + + asm ("sgdt %0" : "=m"(gdt)); + table_base = gdt.base; + + if (selector & 4) { /* from ldt */ + u16 ldt_selector; + + asm ("sldt %0" : "=g"(ldt_selector)); + table_base = segment_base(ldt_selector); + } + d = (struct segment_descriptor *)(table_base + (selector & ~7)); + v = d->base_low | ((ul)d->base_mid << 16) | ((ul)d->base_high << 24); +#ifdef __x86_64__ + if (d->system == 0 + && (d->type == 2 || d->type == 9 || d->type == 11)) + v |= ((ul)((struct segment_descriptor_64 *)d)->base_higher) << 32; +#endif + return v; +} +EXPORT_SYMBOL_GPL(segment_base); + +int kvm_read_guest(struct kvm_vcpu *vcpu, + gva_t addr, + unsigned long size, + void *dest) +{ + unsigned char *host_buf = dest; + unsigned long req_size = size; + + while (size) { + hpa_t paddr; + unsigned now; + unsigned offset; + hva_t guest_buf; + + paddr = gva_to_hpa(vcpu, addr); + + if (is_error_hpa(paddr)) + break; + + guest_buf = (hva_t)kmap_atomic( + pfn_to_page(paddr >> PAGE_SHIFT), + KM_USER0); + offset = addr & ~PAGE_MASK; + guest_buf |= offset; + now = min(size, PAGE_SIZE - offset); + memcpy(host_buf, (void*)guest_buf, now); + host_buf += now; + addr += now; + size -= now; + kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0); + } + return req_size - size; +} +EXPORT_SYMBOL_GPL(kvm_read_guest); + +int kvm_write_guest(struct kvm_vcpu *vcpu, + gva_t addr, + unsigned long size, + void *data) +{ + unsigned char *host_buf = data; + unsigned long req_size = size; + + while (size) { + hpa_t paddr; + unsigned now; + unsigned offset; + hva_t guest_buf; + + paddr = gva_to_hpa(vcpu, addr); + + if (is_error_hpa(paddr)) + break; + + guest_buf = (hva_t)kmap_atomic( + pfn_to_page(paddr >> PAGE_SHIFT), KM_USER0); + offset = addr & ~PAGE_MASK; + guest_buf |= offset; + now = min(size, PAGE_SIZE - offset); + memcpy((void*)guest_buf, host_buf, now); + host_buf += now; + addr += now; + size -= now; + kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0); + } + return req_size - size; +} +EXPORT_SYMBOL_GPL(kvm_write_guest); + +static int vcpu_slot(struct kvm_vcpu *vcpu) +{ + return vcpu - vcpu->kvm->vcpus; +} + +/* + * Switches to specified vcpu, until a matching vcpu_put() + */ +static struct kvm_vcpu *vcpu_load(struct kvm *kvm, int vcpu_slot) +{ + struct kvm_vcpu *vcpu = &kvm->vcpus[vcpu_slot]; + + mutex_lock(&vcpu->mutex); + if (unlikely(!vcpu->vmcs)) { + mutex_unlock(&vcpu->mutex); + return 0; + } + return kvm_arch_ops->vcpu_load(vcpu); +} + +static void vcpu_put(struct kvm_vcpu *vcpu) +{ + kvm_arch_ops->vcpu_put(vcpu); + put_cpu(); + mutex_unlock(&vcpu->mutex); +} + +static int kvm_dev_open(struct inode *inode, struct file *filp) +{ + struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int i; + + if (!kvm) + return -ENOMEM; + + spin_lock_init(&kvm->lock); + INIT_LIST_HEAD(&kvm->active_mmu_pages); + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + struct kvm_vcpu *vcpu = &kvm->vcpus[i]; + + mutex_init(&vcpu->mutex); + vcpu->mmu.root_hpa = INVALID_PAGE; + INIT_LIST_HEAD(&vcpu->free_pages); + } + filp->private_data = kvm; + return 0; +} + +/* + * Free any memory in @free but not in @dont. + */ +static void kvm_free_physmem_slot(struct kvm_memory_slot *free, + struct kvm_memory_slot *dont) +{ + int i; + + if (!dont || free->phys_mem != dont->phys_mem) + if (free->phys_mem) { + for (i = 0; i < free->npages; ++i) + __free_page(free->phys_mem[i]); + vfree(free->phys_mem); + } + + if (!dont || free->dirty_bitmap != dont->dirty_bitmap) + vfree(free->dirty_bitmap); + + free->phys_mem = 0; + free->npages = 0; + free->dirty_bitmap = 0; +} + +static void kvm_free_physmem(struct kvm *kvm) +{ + int i; + + for (i = 0; i < kvm->nmemslots; ++i) + kvm_free_physmem_slot(&kvm->memslots[i], 0); +} + +static void kvm_free_vcpu(struct kvm_vcpu *vcpu) +{ + kvm_arch_ops->vcpu_free(vcpu); + kvm_mmu_destroy(vcpu); +} + +static void kvm_free_vcpus(struct kvm *kvm) +{ + unsigned int i; + + for (i = 0; i < KVM_MAX_VCPUS; ++i) + kvm_free_vcpu(&kvm->vcpus[i]); +} + +static int kvm_dev_release(struct inode *inode, struct file *filp) +{ + struct kvm *kvm = filp->private_data; + + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + kfree(kvm); + return 0; +} + +static void inject_gp(struct kvm_vcpu *vcpu) +{ + kvm_arch_ops->inject_gp(vcpu, 0); +} + +static int pdptrs_have_reserved_bits_set(struct kvm_vcpu *vcpu, + unsigned long cr3) +{ + gfn_t pdpt_gfn = cr3 >> PAGE_SHIFT; + unsigned offset = (cr3 & (PAGE_SIZE-1)) >> 5; + int i; + u64 pdpte; + u64 *pdpt; + struct kvm_memory_slot *memslot; + + spin_lock(&vcpu->kvm->lock); + memslot = gfn_to_memslot(vcpu->kvm, pdpt_gfn); + /* FIXME: !memslot - emulate? 0xff? */ + pdpt = kmap_atomic(gfn_to_page(memslot, pdpt_gfn), KM_USER0); + + for (i = 0; i < 4; ++i) { + pdpte = pdpt[offset + i]; + if ((pdpte & 1) && (pdpte & 0xfffffff0000001e6ull)) + break; + } + + kunmap_atomic(pdpt, KM_USER0); + spin_unlock(&vcpu->kvm->lock); + + return i != 4; +} + +void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) +{ + if (cr0 & CR0_RESEVED_BITS) { + printk(KERN_DEBUG "set_cr0: 0x%lx #GP, reserved bits 0x%lx\n", + cr0, vcpu->cr0); + inject_gp(vcpu); + return; + } + + if ((cr0 & CR0_NW_MASK) && !(cr0 & CR0_CD_MASK)) { + printk(KERN_DEBUG "set_cr0: #GP, CD == 0 && NW == 1\n"); + inject_gp(vcpu); + return; + } + + if ((cr0 & CR0_PG_MASK) && !(cr0 & CR0_PE_MASK)) { + printk(KERN_DEBUG "set_cr0: #GP, set PG flag " + "and a clear PE flag\n"); + inject_gp(vcpu); + return; + } + + if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) { +#ifdef __x86_64__ + if ((vcpu->shadow_efer & EFER_LME)) { + int cs_db, cs_l; + + if (!is_pae(vcpu)) { + printk(KERN_DEBUG "set_cr0: #GP, start paging " + "in long mode while PAE is disabled\n"); + inject_gp(vcpu); + return; + } + kvm_arch_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + if (cs_l) { + printk(KERN_DEBUG "set_cr0: #GP, start paging " + "in long mode while CS.L == 1\n"); + inject_gp(vcpu); + return; + + } + } else +#endif + if (is_pae(vcpu) && + pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) { + printk(KERN_DEBUG "set_cr0: #GP, pdptrs " + "reserved bits\n"); + inject_gp(vcpu); + return; + } + + } + + kvm_arch_ops->set_cr0(vcpu, cr0); + vcpu->cr0 = cr0; + + spin_lock(&vcpu->kvm->lock); + kvm_mmu_reset_context(vcpu); + spin_unlock(&vcpu->kvm->lock); + return; +} +EXPORT_SYMBOL_GPL(set_cr0); + +void lmsw(struct kvm_vcpu *vcpu, unsigned long msw) +{ + set_cr0(vcpu, (vcpu->cr0 & ~0x0ful) | (msw & 0x0f)); +} +EXPORT_SYMBOL_GPL(lmsw); + +void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +{ + if (cr4 & CR4_RESEVED_BITS) { + printk(KERN_DEBUG "set_cr4: #GP, reserved bits\n"); + inject_gp(vcpu); + return; + } + + if (kvm_arch_ops->is_long_mode(vcpu)) { + if (!(cr4 & CR4_PAE_MASK)) { + printk(KERN_DEBUG "set_cr4: #GP, clearing PAE while " + "in long mode\n"); + inject_gp(vcpu); + return; + } + } else if (is_paging(vcpu) && !is_pae(vcpu) && (cr4 & CR4_PAE_MASK) + && pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) { + printk(KERN_DEBUG "set_cr4: #GP, pdptrs reserved bits\n"); + inject_gp(vcpu); + } + + if (cr4 & CR4_VMXE_MASK) { + printk(KERN_DEBUG "set_cr4: #GP, setting VMXE\n"); + inject_gp(vcpu); + return; + } + kvm_arch_ops->set_cr4(vcpu, cr4); + spin_lock(&vcpu->kvm->lock); + kvm_mmu_reset_context(vcpu); + spin_unlock(&vcpu->kvm->lock); +} +EXPORT_SYMBOL_GPL(set_cr4); + +void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) +{ + if (kvm_arch_ops->is_long_mode(vcpu)) { + if ( cr3 & CR3_L_MODE_RESEVED_BITS) { + printk(KERN_DEBUG "set_cr3: #GP, reserved bits\n"); + inject_gp(vcpu); + return; + } + } else { + if (cr3 & CR3_RESEVED_BITS) { + printk(KERN_DEBUG "set_cr3: #GP, reserved bits\n"); + inject_gp(vcpu); + return; + } + if (is_paging(vcpu) && is_pae(vcpu) && + pdptrs_have_reserved_bits_set(vcpu, cr3)) { + printk(KERN_DEBUG "set_cr3: #GP, pdptrs " + "reserved bits\n"); + inject_gp(vcpu); + return; + } + } + + vcpu->cr3 = cr3; + spin_lock(&vcpu->kvm->lock); + vcpu->mmu.new_cr3(vcpu); + spin_unlock(&vcpu->kvm->lock); +} +EXPORT_SYMBOL_GPL(set_cr3); + +void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) +{ + if ( cr8 & CR8_RESEVED_BITS) { + printk(KERN_DEBUG "set_cr8: #GP, reserved bits 0x%lx\n", cr8); + inject_gp(vcpu); + return; + } + vcpu->cr8 = cr8; +} +EXPORT_SYMBOL_GPL(set_cr8); + +void fx_init(struct kvm_vcpu *vcpu) +{ + struct __attribute__ ((__packed__)) fx_image_s { + u16 control; //fcw + u16 status; //fsw + u16 tag; // ftw + u16 opcode; //fop + u64 ip; // fpu ip + u64 operand;// fpu dp + u32 mxcsr; + u32 mxcsr_mask; + + } *fx_image; + + fx_save(vcpu->host_fx_image); + fpu_init(); + fx_save(vcpu->guest_fx_image); + fx_restore(vcpu->host_fx_image); + + fx_image = (struct fx_image_s *)vcpu->guest_fx_image; + fx_image->mxcsr = 0x1f80; + memset(vcpu->guest_fx_image + sizeof(struct fx_image_s), + 0, FX_IMAGE_SIZE - sizeof(struct fx_image_s)); +} +EXPORT_SYMBOL_GPL(fx_init); + +/* + * Creates some virtual cpus. Good luck creating more than one. + */ +static int kvm_dev_ioctl_create_vcpu(struct kvm *kvm, int n) +{ + int r; + struct kvm_vcpu *vcpu; + + r = -EINVAL; + if (n < 0 || n >= KVM_MAX_VCPUS) + goto out; + + vcpu = &kvm->vcpus[n]; + + mutex_lock(&vcpu->mutex); + + if (vcpu->vmcs) { + mutex_unlock(&vcpu->mutex); + return -EEXIST; + } + + vcpu->host_fx_image = (char*)ALIGN((hva_t)vcpu->fx_buf, + FX_IMAGE_ALIGN); + vcpu->guest_fx_image = vcpu->host_fx_image + FX_IMAGE_SIZE; + + vcpu->cpu = -1; /* First load will set up TR */ + vcpu->kvm = kvm; + r = kvm_arch_ops->vcpu_create(vcpu); + if (r < 0) + goto out_free_vcpus; + + kvm_arch_ops->vcpu_load(vcpu); + + r = kvm_arch_ops->vcpu_setup(vcpu); + if (r >= 0) + r = kvm_mmu_init(vcpu); + + vcpu_put(vcpu); + + if (r < 0) + goto out_free_vcpus; + + return 0; + +out_free_vcpus: + kvm_free_vcpu(vcpu); + mutex_unlock(&vcpu->mutex); +out: + return r; +} + +/* + * Allocate some memory and give it an address in the guest physical address + * space. + * + * Discontiguous memory is allowed, mostly for framebuffers. + */ +static int kvm_dev_ioctl_set_memory_region(struct kvm *kvm, + struct kvm_memory_region *mem) +{ + int r; + gfn_t base_gfn; + unsigned long npages; + unsigned long i; + struct kvm_memory_slot *memslot; + struct kvm_memory_slot old, new; + int memory_config_version; + + r = -EINVAL; + /* General sanity checks */ + if (mem->memory_size & (PAGE_SIZE - 1)) + goto out; + if (mem->guest_phys_addr & (PAGE_SIZE - 1)) + goto out; + if (mem->slot >= KVM_MEMORY_SLOTS) + goto out; + if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) + goto out; + + memslot = &kvm->memslots[mem->slot]; + base_gfn = mem->guest_phys_addr >> PAGE_SHIFT; + npages = mem->memory_size >> PAGE_SHIFT; + + if (!npages) + mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES; + +raced: + spin_lock(&kvm->lock); + + memory_config_version = kvm->memory_config_version; + new = old = *memslot; + + new.base_gfn = base_gfn; + new.npages = npages; + new.flags = mem->flags; + + /* Disallow changing a memory slot's size. */ + r = -EINVAL; + if (npages && old.npages && npages != old.npages) + goto out_unlock; + + /* Check for overlaps */ + r = -EEXIST; + for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { + struct kvm_memory_slot *s = &kvm->memslots[i]; + + if (s == memslot) + continue; + if (!((base_gfn + npages <= s->base_gfn) || + (base_gfn >= s->base_gfn + s->npages))) + goto out_unlock; + } + /* + * Do memory allocations outside lock. memory_config_version will + * detect any races. + */ + spin_unlock(&kvm->lock); + + /* Deallocate if slot is being removed */ + if (!npages) + new.phys_mem = 0; + + /* Free page dirty bitmap if unneeded */ + if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES)) + new.dirty_bitmap = 0; + + r = -ENOMEM; + + /* Allocate if a slot is being created */ + if (npages && !new.phys_mem) { + new.phys_mem = vmalloc(npages * sizeof(struct page *)); + + if (!new.phys_mem) + goto out_free; + + memset(new.phys_mem, 0, npages * sizeof(struct page *)); + for (i = 0; i < npages; ++i) { + new.phys_mem[i] = alloc_page(GFP_HIGHUSER); + if (!new.phys_mem[i]) + goto out_free; + } + } + + /* Allocate page dirty bitmap if needed */ + if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) { + unsigned dirty_bytes = ALIGN(npages, BITS_PER_LONG) / 8; + + new.dirty_bitmap = vmalloc(dirty_bytes); + if (!new.dirty_bitmap) + goto out_free; + memset(new.dirty_bitmap, 0, dirty_bytes); + } + + spin_lock(&kvm->lock); + + if (memory_config_version != kvm->memory_config_version) { + spin_unlock(&kvm->lock); + kvm_free_physmem_slot(&new, &old); + goto raced; + } + + r = -EAGAIN; + if (kvm->busy) + goto out_unlock; + + if (mem->slot >= kvm->nmemslots) + kvm->nmemslots = mem->slot + 1; + + *memslot = new; + ++kvm->memory_config_version; + + spin_unlock(&kvm->lock); + + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + struct kvm_vcpu *vcpu; + + vcpu = vcpu_load(kvm, i); + if (!vcpu) + continue; + kvm_mmu_reset_context(vcpu); + vcpu_put(vcpu); + } + + kvm_free_physmem_slot(&old, &new); + return 0; + +out_unlock: + spin_unlock(&kvm->lock); +out_free: + kvm_free_physmem_slot(&new, &old); +out: + return r; +} + +/* + * Get (and clear) the dirty memory log for a memory slot. + */ +static int kvm_dev_ioctl_get_dirty_log(struct kvm *kvm, + struct kvm_dirty_log *log) +{ + struct kvm_memory_slot *memslot; + int r, i; + int n; + unsigned long any = 0; + + spin_lock(&kvm->lock); + + /* + * Prevent changes to guest memory configuration even while the lock + * is not taken. + */ + ++kvm->busy; + spin_unlock(&kvm->lock); + r = -EINVAL; + if (log->slot >= KVM_MEMORY_SLOTS) + goto out; + + memslot = &kvm->memslots[log->slot]; + r = -ENOENT; + if (!memslot->dirty_bitmap) + goto out; + + n = ALIGN(memslot->npages, 8) / 8; + + for (i = 0; !any && i < n; ++i) + any = memslot->dirty_bitmap[i]; + + r = -EFAULT; + if (copy_to_user(log->dirty_bitmap, memslot->dirty_bitmap, n)) + goto out; + + + if (any) { + spin_lock(&kvm->lock); + kvm_mmu_slot_remove_write_access(kvm, log->slot); + spin_unlock(&kvm->lock); + memset(memslot->dirty_bitmap, 0, n); + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + struct kvm_vcpu *vcpu = vcpu_load(kvm, i); + + if (!vcpu) + continue; + kvm_arch_ops->tlb_flush(vcpu); + vcpu_put(vcpu); + } + } + + r = 0; + +out: + spin_lock(&kvm->lock); + --kvm->busy; + spin_unlock(&kvm->lock); + return r; +} + +struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn) +{ + int i; + + for (i = 0; i < kvm->nmemslots; ++i) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + + if (gfn >= memslot->base_gfn + && gfn < memslot->base_gfn + memslot->npages) + return memslot; + } + return 0; +} +EXPORT_SYMBOL_GPL(gfn_to_memslot); + +void mark_page_dirty(struct kvm *kvm, gfn_t gfn) +{ + int i; + struct kvm_memory_slot *memslot = 0; + unsigned long rel_gfn; + + for (i = 0; i < kvm->nmemslots; ++i) { + memslot = &kvm->memslots[i]; + + if (gfn >= memslot->base_gfn + && gfn < memslot->base_gfn + memslot->npages) { + + if (!memslot || !memslot->dirty_bitmap) + return; + + rel_gfn = gfn - memslot->base_gfn; + + /* avoid RMW */ + if (!test_bit(rel_gfn, memslot->dirty_bitmap)) + set_bit(rel_gfn, memslot->dirty_bitmap); + return; + } + } +} + +static int emulator_read_std(unsigned long addr, + unsigned long *val, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->vcpu; + void *data = val; + + while (bytes) { + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + unsigned offset = addr & (PAGE_SIZE-1); + unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset); + unsigned long pfn; + struct kvm_memory_slot *memslot; + void *page; + + if (gpa == UNMAPPED_GVA) + return X86EMUL_PROPAGATE_FAULT; + pfn = gpa >> PAGE_SHIFT; + memslot = gfn_to_memslot(vcpu->kvm, pfn); + if (!memslot) + return X86EMUL_UNHANDLEABLE; + page = kmap_atomic(gfn_to_page(memslot, pfn), KM_USER0); + + memcpy(data, page + offset, tocopy); + + kunmap_atomic(page, KM_USER0); + + bytes -= tocopy; + data += tocopy; + addr += tocopy; + } + + return X86EMUL_CONTINUE; +} + +static int emulator_write_std(unsigned long addr, + unsigned long val, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + printk(KERN_ERR "emulator_write_std: addr %lx n %d\n", + addr, bytes); + return X86EMUL_UNHANDLEABLE; +} + +static int emulator_read_emulated(unsigned long addr, + unsigned long *val, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->vcpu; + + if (vcpu->mmio_read_completed) { + memcpy(val, vcpu->mmio_data, bytes); + vcpu->mmio_read_completed = 0; + return X86EMUL_CONTINUE; + } else if (emulator_read_std(addr, val, bytes, ctxt) + == X86EMUL_CONTINUE) + return X86EMUL_CONTINUE; + else { + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + if (gpa == UNMAPPED_GVA) + return vcpu_printf(vcpu, "not present\n"), X86EMUL_PROPAGATE_FAULT; + vcpu->mmio_needed = 1; + vcpu->mmio_phys_addr = gpa; + vcpu->mmio_size = bytes; + vcpu->mmio_is_write = 0; + + return X86EMUL_UNHANDLEABLE; + } +} + +static int emulator_write_emulated(unsigned long addr, + unsigned long val, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->vcpu; + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + + if (gpa == UNMAPPED_GVA) + return X86EMUL_PROPAGATE_FAULT; + + vcpu->mmio_needed = 1; + vcpu->mmio_phys_addr = gpa; + vcpu->mmio_size = bytes; + vcpu->mmio_is_write = 1; + memcpy(vcpu->mmio_data, &val, bytes); + + return X86EMUL_CONTINUE; +} + +static int emulator_cmpxchg_emulated(unsigned long addr, + unsigned long old, + unsigned long new, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + static int reported; + + if (!reported) { + reported = 1; + printk(KERN_WARNING "kvm: emulating exchange as write\n"); + } + return emulator_write_emulated(addr, new, bytes, ctxt); +} + +static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg) +{ + return kvm_arch_ops->get_segment_base(vcpu, seg); +} + +int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address) +{ + spin_lock(&vcpu->kvm->lock); + vcpu->mmu.inval_page(vcpu, address); + spin_unlock(&vcpu->kvm->lock); + kvm_arch_ops->invlpg(vcpu, address); + return X86EMUL_CONTINUE; +} + +int emulate_clts(struct kvm_vcpu *vcpu) +{ + unsigned long cr0 = vcpu->cr0; + + cr0 &= ~CR0_TS_MASK; + kvm_arch_ops->set_cr0(vcpu, cr0); + return X86EMUL_CONTINUE; +} + +int emulator_get_dr(struct x86_emulate_ctxt* ctxt, int dr, unsigned long *dest) +{ + struct kvm_vcpu *vcpu = ctxt->vcpu; + + switch (dr) { + case 0 ... 3: + *dest = kvm_arch_ops->get_dr(vcpu, dr); + return X86EMUL_CONTINUE; + default: + printk(KERN_DEBUG "%s: unexpected dr %u\n", + __FUNCTION__, dr); + return X86EMUL_UNHANDLEABLE; + } +} + +int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr, unsigned long value) +{ + unsigned long mask = (ctxt->mode == X86EMUL_MODE_PROT64) ? ~0ULL : ~0U; + int exception; + + kvm_arch_ops->set_dr(ctxt->vcpu, dr, value & mask, &exception); + if (exception) { + /* FIXME: better handling */ + return X86EMUL_UNHANDLEABLE; + } + return X86EMUL_CONTINUE; +} + +static void report_emulation_failure(struct x86_emulate_ctxt *ctxt) +{ + static int reported; + u8 opcodes[4]; + unsigned long rip = ctxt->vcpu->rip; + unsigned long rip_linear; + + rip_linear = rip + get_segment_base(ctxt->vcpu, VCPU_SREG_CS); + + if (reported) + return; + + emulator_read_std(rip_linear, (void *)opcodes, 4, ctxt); + + printk(KERN_ERR "emulation failed but !mmio_needed?" + " rip %lx %02x %02x %02x %02x\n", + rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]); + reported = 1; +} + +struct x86_emulate_ops emulate_ops = { + .read_std = emulator_read_std, + .write_std = emulator_write_std, + .read_emulated = emulator_read_emulated, + .write_emulated = emulator_write_emulated, + .cmpxchg_emulated = emulator_cmpxchg_emulated, +}; + +int emulate_instruction(struct kvm_vcpu *vcpu, + struct kvm_run *run, + unsigned long cr2, + u16 error_code) +{ + struct x86_emulate_ctxt emulate_ctxt; + int r; + int cs_db, cs_l; + + kvm_arch_ops->cache_regs(vcpu); + + kvm_arch_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + + emulate_ctxt.vcpu = vcpu; + emulate_ctxt.eflags = kvm_arch_ops->get_rflags(vcpu); + emulate_ctxt.cr2 = cr2; + emulate_ctxt.mode = (emulate_ctxt.eflags & X86_EFLAGS_VM) + ? X86EMUL_MODE_REAL : cs_l + ? X86EMUL_MODE_PROT64 : cs_db + ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; + + if (emulate_ctxt.mode == X86EMUL_MODE_PROT64) { + emulate_ctxt.cs_base = 0; + emulate_ctxt.ds_base = 0; + emulate_ctxt.es_base = 0; + emulate_ctxt.ss_base = 0; + } else { + emulate_ctxt.cs_base = get_segment_base(vcpu, VCPU_SREG_CS); + emulate_ctxt.ds_base = get_segment_base(vcpu, VCPU_SREG_DS); + emulate_ctxt.es_base = get_segment_base(vcpu, VCPU_SREG_ES); + emulate_ctxt.ss_base = get_segment_base(vcpu, VCPU_SREG_SS); + } + + emulate_ctxt.gs_base = get_segment_base(vcpu, VCPU_SREG_GS); + emulate_ctxt.fs_base = get_segment_base(vcpu, VCPU_SREG_FS); + + vcpu->mmio_is_write = 0; + r = x86_emulate_memop(&emulate_ctxt, &emulate_ops); + + if ((r || vcpu->mmio_is_write) && run) { + run->mmio.phys_addr = vcpu->mmio_phys_addr; + memcpy(run->mmio.data, vcpu->mmio_data, 8); + run->mmio.len = vcpu->mmio_size; + run->mmio.is_write = vcpu->mmio_is_write; + } + + if (r) { + if (!vcpu->mmio_needed) { + report_emulation_failure(&emulate_ctxt); + return EMULATE_FAIL; + } + return EMULATE_DO_MMIO; + } + + kvm_arch_ops->decache_regs(vcpu); + kvm_arch_ops->set_rflags(vcpu, emulate_ctxt.eflags); + + if (vcpu->mmio_is_write) + return EMULATE_DO_MMIO; + + return EMULATE_DONE; +} +EXPORT_SYMBOL_GPL(emulate_instruction); + +static u64 mk_cr_64(u64 curr_cr, u32 new_val) +{ + return (curr_cr & ~((1ULL << 32) - 1)) | new_val; +} + +void realmode_lgdt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base) +{ + struct descriptor_table dt = { limit, base }; + + kvm_arch_ops->set_gdt(vcpu, &dt); +} + +void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base) +{ + struct descriptor_table dt = { limit, base }; + + kvm_arch_ops->set_idt(vcpu, &dt); +} + +void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw, + unsigned long *rflags) +{ + lmsw(vcpu, msw); + *rflags = kvm_arch_ops->get_rflags(vcpu); +} + +unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr) +{ + switch (cr) { + case 0: + return vcpu->cr0; + case 2: + return vcpu->cr2; + case 3: + return vcpu->cr3; + case 4: + return vcpu->cr4; + default: + vcpu_printf(vcpu, "%s: unexpected cr %u\n", __FUNCTION__, cr); + return 0; + } +} + +void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val, + unsigned long *rflags) +{ + switch (cr) { + case 0: + set_cr0(vcpu, mk_cr_64(vcpu->cr0, val)); + *rflags = kvm_arch_ops->get_rflags(vcpu); + break; + case 2: + vcpu->cr2 = val; + break; + case 3: + set_cr3(vcpu, val); + break; + case 4: + set_cr4(vcpu, mk_cr_64(vcpu->cr4, val)); + break; + default: + vcpu_printf(vcpu, "%s: unexpected cr %u\n", __FUNCTION__, cr); + } +} + +/* + * Reads an msr value (of 'msr_index') into 'pdata'. + * Returns 0 on success, non-0 otherwise. + * Assumes vcpu_load() was already called. + */ +static int get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) +{ + return kvm_arch_ops->get_msr(vcpu, msr_index, pdata); +} + +#ifdef __x86_64__ + +void set_efer(struct kvm_vcpu *vcpu, u64 efer) +{ + if (efer & EFER_RESERVED_BITS) { + printk(KERN_DEBUG "set_efer: 0x%llx #GP, reserved bits\n", + efer); + inject_gp(vcpu); + return; + } + + if (is_paging(vcpu) + && (vcpu->shadow_efer & EFER_LME) != (efer & EFER_LME)) { + printk(KERN_DEBUG "set_efer: #GP, change LME while paging\n"); + inject_gp(vcpu); + return; + } + + kvm_arch_ops->set_efer(vcpu, efer); + + efer &= ~EFER_LMA; + efer |= vcpu->shadow_efer & EFER_LMA; + + vcpu->shadow_efer = efer; +} +EXPORT_SYMBOL_GPL(set_efer); + +#endif + +/* + * Writes msr value into into the appropriate "register". + * Returns 0 on success, non-0 otherwise. + * Assumes vcpu_load() was already called. + */ +static int set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) +{ + return kvm_arch_ops->set_msr(vcpu, msr_index, data); +} + +void kvm_resched(struct kvm_vcpu *vcpu) +{ + vcpu_put(vcpu); + cond_resched(); + /* Cannot fail - no vcpu unplug yet. */ + vcpu_load(vcpu->kvm, vcpu_slot(vcpu)); +} +EXPORT_SYMBOL_GPL(kvm_resched); + +void load_msrs(struct vmx_msr_entry *e, int n) +{ + int i; + + for (i = 0; i < n; ++i) + wrmsrl(e[i].index, e[i].data); +} +EXPORT_SYMBOL_GPL(load_msrs); + +void save_msrs(struct vmx_msr_entry *e, int n) +{ + int i; + + for (i = 0; i < n; ++i) + rdmsrl(e[i].index, e[i].data); +} +EXPORT_SYMBOL_GPL(save_msrs); + +static int kvm_dev_ioctl_run(struct kvm *kvm, struct kvm_run *kvm_run) +{ + struct kvm_vcpu *vcpu; + int r; + + if (kvm_run->vcpu < 0 || kvm_run->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + + vcpu = vcpu_load(kvm, kvm_run->vcpu); + if (!vcpu) + return -ENOENT; + + if (kvm_run->emulated) { + kvm_arch_ops->skip_emulated_instruction(vcpu); + kvm_run->emulated = 0; + } + + if (kvm_run->mmio_completed) { + memcpy(vcpu->mmio_data, kvm_run->mmio.data, 8); + vcpu->mmio_read_completed = 1; + } + + vcpu->mmio_needed = 0; + + r = kvm_arch_ops->run(vcpu, kvm_run); + + vcpu_put(vcpu); + return r; +} + +static int kvm_dev_ioctl_get_regs(struct kvm *kvm, struct kvm_regs *regs) +{ + struct kvm_vcpu *vcpu; + + if (regs->vcpu < 0 || regs->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + + vcpu = vcpu_load(kvm, regs->vcpu); + if (!vcpu) + return -ENOENT; + + kvm_arch_ops->cache_regs(vcpu); + + regs->rax = vcpu->regs[VCPU_REGS_RAX]; + regs->rbx = vcpu->regs[VCPU_REGS_RBX]; + regs->rcx = vcpu->regs[VCPU_REGS_RCX]; + regs->rdx = vcpu->regs[VCPU_REGS_RDX]; + regs->rsi = vcpu->regs[VCPU_REGS_RSI]; + regs->rdi = vcpu->regs[VCPU_REGS_RDI]; + regs->rsp = vcpu->regs[VCPU_REGS_RSP]; + regs->rbp = vcpu->regs[VCPU_REGS_RBP]; +#ifdef __x86_64__ + regs->r8 = vcpu->regs[VCPU_REGS_R8]; + regs->r9 = vcpu->regs[VCPU_REGS_R9]; + regs->r10 = vcpu->regs[VCPU_REGS_R10]; + regs->r11 = vcpu->regs[VCPU_REGS_R11]; + regs->r12 = vcpu->regs[VCPU_REGS_R12]; + regs->r13 = vcpu->regs[VCPU_REGS_R13]; + regs->r14 = vcpu->regs[VCPU_REGS_R14]; + regs->r15 = vcpu->regs[VCPU_REGS_R15]; +#endif + + regs->rip = vcpu->rip; + regs->rflags = kvm_arch_ops->get_rflags(vcpu); + + /* + * Don't leak debug flags in case they were set for guest debugging + */ + if (vcpu->guest_debug.enabled && vcpu->guest_debug.singlestep) + regs->rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF); + + vcpu_put(vcpu); + + return 0; +} + +static int kvm_dev_ioctl_set_regs(struct kvm *kvm, struct kvm_regs *regs) +{ + struct kvm_vcpu *vcpu; + + if (regs->vcpu < 0 || regs->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + + vcpu = vcpu_load(kvm, regs->vcpu); + if (!vcpu) + return -ENOENT; + + vcpu->regs[VCPU_REGS_RAX] = regs->rax; + vcpu->regs[VCPU_REGS_RBX] = regs->rbx; + vcpu->regs[VCPU_REGS_RCX] = regs->rcx; + vcpu->regs[VCPU_REGS_RDX] = regs->rdx; + vcpu->regs[VCPU_REGS_RSI] = regs->rsi; + vcpu->regs[VCPU_REGS_RDI] = regs->rdi; + vcpu->regs[VCPU_REGS_RSP] = regs->rsp; + vcpu->regs[VCPU_REGS_RBP] = regs->rbp; +#ifdef __x86_64__ + vcpu->regs[VCPU_REGS_R8] = regs->r8; + vcpu->regs[VCPU_REGS_R9] = regs->r9; + vcpu->regs[VCPU_REGS_R10] = regs->r10; + vcpu->regs[VCPU_REGS_R11] = regs->r11; + vcpu->regs[VCPU_REGS_R12] = regs->r12; + vcpu->regs[VCPU_REGS_R13] = regs->r13; + vcpu->regs[VCPU_REGS_R14] = regs->r14; + vcpu->regs[VCPU_REGS_R15] = regs->r15; +#endif + + vcpu->rip = regs->rip; + kvm_arch_ops->set_rflags(vcpu, regs->rflags); + + kvm_arch_ops->decache_regs(vcpu); + + vcpu_put(vcpu); + + return 0; +} + +static void get_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + return kvm_arch_ops->get_segment(vcpu, var, seg); +} + +static int kvm_dev_ioctl_get_sregs(struct kvm *kvm, struct kvm_sregs *sregs) +{ + struct kvm_vcpu *vcpu; + struct descriptor_table dt; + + if (sregs->vcpu < 0 || sregs->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + vcpu = vcpu_load(kvm, sregs->vcpu); + if (!vcpu) + return -ENOENT; + + get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); + get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); + get_segment(vcpu, &sregs->es, VCPU_SREG_ES); + get_segment(vcpu, &sregs->fs, VCPU_SREG_FS); + get_segment(vcpu, &sregs->gs, VCPU_SREG_GS); + get_segment(vcpu, &sregs->ss, VCPU_SREG_SS); + + get_segment(vcpu, &sregs->tr, VCPU_SREG_TR); + get_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR); + + kvm_arch_ops->get_idt(vcpu, &dt); + sregs->idt.limit = dt.limit; + sregs->idt.base = dt.base; + kvm_arch_ops->get_gdt(vcpu, &dt); + sregs->gdt.limit = dt.limit; + sregs->gdt.base = dt.base; + + sregs->cr0 = vcpu->cr0; + sregs->cr2 = vcpu->cr2; + sregs->cr3 = vcpu->cr3; + sregs->cr4 = vcpu->cr4; + sregs->cr8 = vcpu->cr8; + sregs->efer = vcpu->shadow_efer; + sregs->apic_base = vcpu->apic_base; + + memcpy(sregs->interrupt_bitmap, vcpu->irq_pending, + sizeof sregs->interrupt_bitmap); + + vcpu_put(vcpu); + + return 0; +} + +static void set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + return kvm_arch_ops->set_segment(vcpu, var, seg); +} + +static int kvm_dev_ioctl_set_sregs(struct kvm *kvm, struct kvm_sregs *sregs) +{ + struct kvm_vcpu *vcpu; + int mmu_reset_needed = 0; + int i; + struct descriptor_table dt; + + if (sregs->vcpu < 0 || sregs->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + vcpu = vcpu_load(kvm, sregs->vcpu); + if (!vcpu) + return -ENOENT; + + set_segment(vcpu, &sregs->cs, VCPU_SREG_CS); + set_segment(vcpu, &sregs->ds, VCPU_SREG_DS); + set_segment(vcpu, &sregs->es, VCPU_SREG_ES); + set_segment(vcpu, &sregs->fs, VCPU_SREG_FS); + set_segment(vcpu, &sregs->gs, VCPU_SREG_GS); + set_segment(vcpu, &sregs->ss, VCPU_SREG_SS); + + set_segment(vcpu, &sregs->tr, VCPU_SREG_TR); + set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR); + + dt.limit = sregs->idt.limit; + dt.base = sregs->idt.base; + kvm_arch_ops->set_idt(vcpu, &dt); + dt.limit = sregs->gdt.limit; + dt.base = sregs->gdt.base; + kvm_arch_ops->set_gdt(vcpu, &dt); + + vcpu->cr2 = sregs->cr2; + mmu_reset_needed |= vcpu->cr3 != sregs->cr3; + vcpu->cr3 = sregs->cr3; + + vcpu->cr8 = sregs->cr8; + + mmu_reset_needed |= vcpu->shadow_efer != sregs->efer; +#ifdef __x86_64__ + kvm_arch_ops->set_efer(vcpu, sregs->efer); +#endif + vcpu->apic_base = sregs->apic_base; + + mmu_reset_needed |= vcpu->cr0 != sregs->cr0; + kvm_arch_ops->set_cr0_no_modeswitch(vcpu, sregs->cr0); + + mmu_reset_needed |= vcpu->cr4 != sregs->cr4; + kvm_arch_ops->set_cr4(vcpu, sregs->cr4); + + if (mmu_reset_needed) + kvm_mmu_reset_context(vcpu); + + memcpy(vcpu->irq_pending, sregs->interrupt_bitmap, + sizeof vcpu->irq_pending); + vcpu->irq_summary = 0; + for (i = 0; i < NR_IRQ_WORDS; ++i) + if (vcpu->irq_pending[i]) + __set_bit(i, &vcpu->irq_summary); + + vcpu_put(vcpu); + + return 0; +} + +/* + * List of msr numbers which we expose to userspace through KVM_GET_MSRS + * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST. + */ +static u32 msrs_to_save[] = { + MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, + MSR_K6_STAR, +#ifdef __x86_64__ + MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR, +#endif + MSR_IA32_TIME_STAMP_COUNTER, +}; + + +/* + * Adapt set_msr() to msr_io()'s calling convention + */ +static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data) +{ + return set_msr(vcpu, index, *data); +} + +/* + * Read or write a bunch of msrs. All parameters are kernel addresses. + * + * @return number of msrs set successfully. + */ +static int __msr_io(struct kvm *kvm, struct kvm_msrs *msrs, + struct kvm_msr_entry *entries, + int (*do_msr)(struct kvm_vcpu *vcpu, + unsigned index, u64 *data)) +{ + struct kvm_vcpu *vcpu; + int i; + + if (msrs->vcpu < 0 || msrs->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + + vcpu = vcpu_load(kvm, msrs->vcpu); + if (!vcpu) + return -ENOENT; + + for (i = 0; i < msrs->nmsrs; ++i) + if (do_msr(vcpu, entries[i].index, &entries[i].data)) + break; + + vcpu_put(vcpu); + + return i; +} + +/* + * Read or write a bunch of msrs. Parameters are user addresses. + * + * @return number of msrs set successfully. + */ +static int msr_io(struct kvm *kvm, struct kvm_msrs __user *user_msrs, + int (*do_msr)(struct kvm_vcpu *vcpu, + unsigned index, u64 *data), + int writeback) +{ + struct kvm_msrs msrs; + struct kvm_msr_entry *entries; + int r, n; + unsigned size; + + r = -EFAULT; + if (copy_from_user(&msrs, user_msrs, sizeof msrs)) + goto out; + + r = -E2BIG; + if (msrs.nmsrs >= MAX_IO_MSRS) + goto out; + + r = -ENOMEM; + size = sizeof(struct kvm_msr_entry) * msrs.nmsrs; + entries = vmalloc(size); + if (!entries) + goto out; + + r = -EFAULT; + if (copy_from_user(entries, user_msrs->entries, size)) + goto out_free; + + r = n = __msr_io(kvm, &msrs, entries, do_msr); + if (r < 0) + goto out_free; + + r = -EFAULT; + if (writeback && copy_to_user(user_msrs->entries, entries, size)) + goto out_free; + + r = n; + +out_free: + vfree(entries); +out: + return r; +} + +/* + * Translate a guest virtual address to a guest physical address. + */ +static int kvm_dev_ioctl_translate(struct kvm *kvm, struct kvm_translation *tr) +{ + unsigned long vaddr = tr->linear_address; + struct kvm_vcpu *vcpu; + gpa_t gpa; + + vcpu = vcpu_load(kvm, tr->vcpu); + if (!vcpu) + return -ENOENT; + spin_lock(&kvm->lock); + gpa = vcpu->mmu.gva_to_gpa(vcpu, vaddr); + tr->physical_address = gpa; + tr->valid = gpa != UNMAPPED_GVA; + tr->writeable = 1; + tr->usermode = 0; + spin_unlock(&kvm->lock); + vcpu_put(vcpu); + + return 0; +} + +static int kvm_dev_ioctl_interrupt(struct kvm *kvm, struct kvm_interrupt *irq) +{ + struct kvm_vcpu *vcpu; + + if (irq->vcpu < 0 || irq->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + if (irq->irq < 0 || irq->irq >= 256) + return -EINVAL; + vcpu = vcpu_load(kvm, irq->vcpu); + if (!vcpu) + return -ENOENT; + + set_bit(irq->irq, vcpu->irq_pending); + set_bit(irq->irq / BITS_PER_LONG, &vcpu->irq_summary); + + vcpu_put(vcpu); + + return 0; +} + +static int kvm_dev_ioctl_debug_guest(struct kvm *kvm, + struct kvm_debug_guest *dbg) +{ + struct kvm_vcpu *vcpu; + int r; + + if (dbg->vcpu < 0 || dbg->vcpu >= KVM_MAX_VCPUS) + return -EINVAL; + vcpu = vcpu_load(kvm, dbg->vcpu); + if (!vcpu) + return -ENOENT; + + r = kvm_arch_ops->set_guest_debug(vcpu, dbg); + + vcpu_put(vcpu); + + return r; +} + +static long kvm_dev_ioctl(struct file *filp, + unsigned int ioctl, unsigned long arg) +{ + struct kvm *kvm = filp->private_data; + int r = -EINVAL; + + switch (ioctl) { + case KVM_CREATE_VCPU: { + r = kvm_dev_ioctl_create_vcpu(kvm, arg); + if (r) + goto out; + break; + } + case KVM_RUN: { + struct kvm_run kvm_run; + + r = -EFAULT; + if (copy_from_user(&kvm_run, (void *)arg, sizeof kvm_run)) + goto out; + r = kvm_dev_ioctl_run(kvm, &kvm_run); + if (r < 0) + goto out; + r = -EFAULT; + if (copy_to_user((void *)arg, &kvm_run, sizeof kvm_run)) + goto out; + r = 0; + break; + } + case KVM_GET_REGS: { + struct kvm_regs kvm_regs; + + r = -EFAULT; + if (copy_from_user(&kvm_regs, (void *)arg, sizeof kvm_regs)) + goto out; + r = kvm_dev_ioctl_get_regs(kvm, &kvm_regs); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user((void *)arg, &kvm_regs, sizeof kvm_regs)) + goto out; + r = 0; + break; + } + case KVM_SET_REGS: { + struct kvm_regs kvm_regs; + + r = -EFAULT; + if (copy_from_user(&kvm_regs, (void *)arg, sizeof kvm_regs)) + goto out; + r = kvm_dev_ioctl_set_regs(kvm, &kvm_regs); + if (r) + goto out; + r = 0; + break; + } + case KVM_GET_SREGS: { + struct kvm_sregs kvm_sregs; + + r = -EFAULT; + if (copy_from_user(&kvm_sregs, (void *)arg, sizeof kvm_sregs)) + goto out; + r = kvm_dev_ioctl_get_sregs(kvm, &kvm_sregs); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user((void *)arg, &kvm_sregs, sizeof kvm_sregs)) + goto out; + r = 0; + break; + } + case KVM_SET_SREGS: { + struct kvm_sregs kvm_sregs; + + r = -EFAULT; + if (copy_from_user(&kvm_sregs, (void *)arg, sizeof kvm_sregs)) + goto out; + r = kvm_dev_ioctl_set_sregs(kvm, &kvm_sregs); + if (r) + goto out; + r = 0; + break; + } + case KVM_TRANSLATE: { + struct kvm_translation tr; + + r = -EFAULT; + if (copy_from_user(&tr, (void *)arg, sizeof tr)) + goto out; + r = kvm_dev_ioctl_translate(kvm, &tr); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user((void *)arg, &tr, sizeof tr)) + goto out; + r = 0; + break; + } + case KVM_INTERRUPT: { + struct kvm_interrupt irq; + + r = -EFAULT; + if (copy_from_user(&irq, (void *)arg, sizeof irq)) + goto out; + r = kvm_dev_ioctl_interrupt(kvm, &irq); + if (r) + goto out; + r = 0; + break; + } + case KVM_DEBUG_GUEST: { + struct kvm_debug_guest dbg; + + r = -EFAULT; + if (copy_from_user(&dbg, (void *)arg, sizeof dbg)) + goto out; + r = kvm_dev_ioctl_debug_guest(kvm, &dbg); + if (r) + goto out; + r = 0; + break; + } + case KVM_SET_MEMORY_REGION: { + struct kvm_memory_region kvm_mem; + + r = -EFAULT; + if (copy_from_user(&kvm_mem, (void *)arg, sizeof kvm_mem)) + goto out; + r = kvm_dev_ioctl_set_memory_region(kvm, &kvm_mem); + if (r) + goto out; + break; + } + case KVM_GET_DIRTY_LOG: { + struct kvm_dirty_log log; + + r = -EFAULT; + if (copy_from_user(&log, (void *)arg, sizeof log)) + goto out; + r = kvm_dev_ioctl_get_dirty_log(kvm, &log); + if (r) + goto out; + break; + } + case KVM_GET_MSRS: + r = msr_io(kvm, (void __user *)arg, get_msr, 1); + break; + case KVM_SET_MSRS: + r = msr_io(kvm, (void __user *)arg, do_set_msr, 0); + break; + case KVM_GET_MSR_INDEX_LIST: { + struct kvm_msr_list __user *user_msr_list = (void __user *)arg; + struct kvm_msr_list msr_list; + unsigned n; + + r = -EFAULT; + if (copy_from_user(&msr_list, user_msr_list, sizeof msr_list)) + goto out; + n = msr_list.nmsrs; + msr_list.nmsrs = ARRAY_SIZE(msrs_to_save); + if (copy_to_user(user_msr_list, &msr_list, sizeof msr_list)) + goto out; + r = -E2BIG; + if (n < ARRAY_SIZE(msrs_to_save)) + goto out; + r = -EFAULT; + if (copy_to_user(user_msr_list->indices, &msrs_to_save, + sizeof msrs_to_save)) + goto out; + r = 0; + } + default: + ; + } +out: + return r; +} + +static struct page *kvm_dev_nopage(struct vm_area_struct *vma, + unsigned long address, + int *type) +{ + struct kvm *kvm = vma->vm_file->private_data; + unsigned long pgoff; + struct kvm_memory_slot *slot; + struct page *page; + + *type = VM_FAULT_MINOR; + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + slot = gfn_to_memslot(kvm, pgoff); + if (!slot) + return NOPAGE_SIGBUS; + page = gfn_to_page(slot, pgoff); + if (!page) + return NOPAGE_SIGBUS; + get_page(page); + return page; +} + +static struct vm_operations_struct kvm_dev_vm_ops = { + .nopage = kvm_dev_nopage, +}; + +static int kvm_dev_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &kvm_dev_vm_ops; + return 0; +} + +static struct file_operations kvm_chardev_ops = { + .open = kvm_dev_open, + .release = kvm_dev_release, + .unlocked_ioctl = kvm_dev_ioctl, + .compat_ioctl = kvm_dev_ioctl, + .mmap = kvm_dev_mmap, +}; + +static struct miscdevice kvm_dev = { + MISC_DYNAMIC_MINOR, + "kvm", + &kvm_chardev_ops, +}; + +static int kvm_reboot(struct notifier_block *notifier, unsigned long val, + void *v) +{ + if (val == SYS_RESTART) { + /* + * Some (well, at least mine) BIOSes hang on reboot if + * in vmx root mode. + */ + printk(KERN_INFO "kvm: exiting hardware virtualization\n"); + on_each_cpu(kvm_arch_ops->hardware_disable, 0, 0, 1); + } + return NOTIFY_OK; +} + +static struct notifier_block kvm_reboot_notifier = { + .notifier_call = kvm_reboot, + .priority = 0, +}; + +static __init void kvm_init_debug(void) +{ + struct kvm_stats_debugfs_item *p; + + debugfs_dir = debugfs_create_dir("kvm", 0); + for (p = debugfs_entries; p->name; ++p) + p->dentry = debugfs_create_u32(p->name, 0444, debugfs_dir, + p->data); +} + +static void kvm_exit_debug(void) +{ + struct kvm_stats_debugfs_item *p; + + for (p = debugfs_entries; p->name; ++p) + debugfs_remove(p->dentry); + debugfs_remove(debugfs_dir); +} + +hpa_t bad_page_address; + +int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module) +{ + int r; + + kvm_arch_ops = ops; + + if (!kvm_arch_ops->cpu_has_kvm_support()) { + printk(KERN_ERR "kvm: no hardware support\n"); + return -EOPNOTSUPP; + } + if (kvm_arch_ops->disabled_by_bios()) { + printk(KERN_ERR "kvm: disabled by bios\n"); + return -EOPNOTSUPP; + } + + r = kvm_arch_ops->hardware_setup(); + if (r < 0) + return r; + + on_each_cpu(kvm_arch_ops->hardware_enable, 0, 0, 1); + register_reboot_notifier(&kvm_reboot_notifier); + + kvm_chardev_ops.owner = module; + + r = misc_register(&kvm_dev); + if (r) { + printk (KERN_ERR "kvm: misc device register failed\n"); + goto out_free; + } + + return r; + +out_free: + unregister_reboot_notifier(&kvm_reboot_notifier); + on_each_cpu(kvm_arch_ops->hardware_disable, 0, 0, 1); + kvm_arch_ops->hardware_unsetup(); + return r; +} + +void kvm_exit_arch(void) +{ + misc_deregister(&kvm_dev); + + unregister_reboot_notifier(&kvm_reboot_notifier); + on_each_cpu(kvm_arch_ops->hardware_disable, 0, 0, 1); + kvm_arch_ops->hardware_unsetup(); +} + +static __init int kvm_init(void) +{ + static struct page *bad_page; + int r = 0; + + kvm_init_debug(); + + if ((bad_page = alloc_page(GFP_KERNEL)) == NULL) { + r = -ENOMEM; + goto out; + } + + bad_page_address = page_to_pfn(bad_page) << PAGE_SHIFT; + memset(__va(bad_page_address), 0, PAGE_SIZE); + + return r; + +out: + kvm_exit_debug(); + return r; +} + +static __exit void kvm_exit(void) +{ + kvm_exit_debug(); + __free_page(pfn_to_page(bad_page_address >> PAGE_SHIFT)); +} + +module_init(kvm_init) +module_exit(kvm_exit) + +EXPORT_SYMBOL_GPL(kvm_init_arch); +EXPORT_SYMBOL_GPL(kvm_exit_arch); Index: linux/drivers/kvm/kvm_svm.h =================================================================== --- /dev/null +++ linux/drivers/kvm/kvm_svm.h @@ -0,0 +1,42 @@ +#ifndef __KVM_SVM_H +#define __KVM_SVM_H + +#include +#include +#include + +#include "svm.h" +#include "kvm.h" + +static const u32 host_save_msrs[] = { + MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, + MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, + MSR_IA32_DEBUGCTLMSR, /*MSR_IA32_LASTBRANCHFROMIP, + MSR_IA32_LASTBRANCHTOIP, MSR_IA32_LASTINTFROMIP,MSR_IA32_LASTINTTOIP,*/ + MSR_FS_BASE, MSR_GS_BASE, +}; + +#define NR_HOST_SAVE_MSRS (sizeof(host_save_msrs) / sizeof(*host_save_msrs)) +#define NUM_DB_REGS 4 + +struct vcpu_svm { + struct vmcb *vmcb; + unsigned long vmcb_pa; + struct svm_cpu_data *svm_data; + uint64_t asid_generation; + + unsigned long cr0; + unsigned long cr4; + unsigned long db_regs[NUM_DB_REGS]; + + u64 next_rip; + + u64 host_msrs[NR_HOST_SAVE_MSRS]; + unsigned long host_cr2; + unsigned long host_db_regs[NUM_DB_REGS]; + unsigned long host_dr6; + unsigned long host_dr7; +}; + +#endif + Index: linux/drivers/kvm/kvm_vmx.h =================================================================== --- /dev/null +++ linux/drivers/kvm/kvm_vmx.h @@ -0,0 +1,14 @@ +#ifndef __KVM_VMX_H +#define __KVM_VMX_H + +#ifdef __x86_64__ +/* + * avoid save/load MSR_SYSCALL_MASK and MSR_LSTAR by std vt + * mechanism (cpu bug AA24) + */ +#define NR_BAD_MSRS 2 +#else +#define NR_BAD_MSRS 0 +#endif + +#endif Index: linux/drivers/kvm/mmu.c =================================================================== --- /dev/null +++ linux/drivers/kvm/mmu.c @@ -0,0 +1,699 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables machines with Intel VT-x extensions to run virtual + * machines without emulation or binary translation. + * + * MMU support + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include +#include +#include +#include +#include +#include + +#include "vmx.h" +#include "kvm.h" + +#define pgprintk(x...) do { } while (0) + +#define ASSERT(x) \ + if (!(x)) { \ + printk(KERN_WARNING "assertion failed %s:%d: %s\n", \ + __FILE__, __LINE__, #x); \ + } + +#define PT64_ENT_PER_PAGE 512 +#define PT32_ENT_PER_PAGE 1024 + +#define PT_WRITABLE_SHIFT 1 + +#define PT_PRESENT_MASK (1ULL << 0) +#define PT_WRITABLE_MASK (1ULL << PT_WRITABLE_SHIFT) +#define PT_USER_MASK (1ULL << 2) +#define PT_PWT_MASK (1ULL << 3) +#define PT_PCD_MASK (1ULL << 4) +#define PT_ACCESSED_MASK (1ULL << 5) +#define PT_DIRTY_MASK (1ULL << 6) +#define PT_PAGE_SIZE_MASK (1ULL << 7) +#define PT_PAT_MASK (1ULL << 7) +#define PT_GLOBAL_MASK (1ULL << 8) +#define PT64_NX_MASK (1ULL << 63) + +#define PT_PAT_SHIFT 7 +#define PT_DIR_PAT_SHIFT 12 +#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT) + +#define PT32_DIR_PSE36_SIZE 4 +#define PT32_DIR_PSE36_SHIFT 13 +#define PT32_DIR_PSE36_MASK (((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT) + + +#define PT32_PTE_COPY_MASK \ + (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \ + PT_ACCESSED_MASK | PT_DIRTY_MASK | PT_PAT_MASK | \ + PT_GLOBAL_MASK ) + +#define PT32_NON_PTE_COPY_MASK \ + (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \ + PT_ACCESSED_MASK | PT_DIRTY_MASK) + + +#define PT64_PTE_COPY_MASK \ + (PT64_NX_MASK | PT32_PTE_COPY_MASK) + +#define PT64_NON_PTE_COPY_MASK \ + (PT64_NX_MASK | PT32_NON_PTE_COPY_MASK) + + + +#define PT_FIRST_AVAIL_BITS_SHIFT 9 +#define PT64_SECOND_AVAIL_BITS_SHIFT 52 + +#define PT_SHADOW_PS_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) +#define PT_SHADOW_IO_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) + +#define PT_SHADOW_WRITABLE_SHIFT (PT_FIRST_AVAIL_BITS_SHIFT + 1) +#define PT_SHADOW_WRITABLE_MASK (1ULL << PT_SHADOW_WRITABLE_SHIFT) + +#define PT_SHADOW_USER_SHIFT (PT_SHADOW_WRITABLE_SHIFT + 1) +#define PT_SHADOW_USER_MASK (1ULL << (PT_SHADOW_USER_SHIFT)) + +#define PT_SHADOW_BITS_OFFSET (PT_SHADOW_WRITABLE_SHIFT - PT_WRITABLE_SHIFT) + +#define VALID_PAGE(x) ((x) != INVALID_PAGE) + +#define PT64_LEVEL_BITS 9 + +#define PT64_LEVEL_SHIFT(level) \ + ( PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS ) + +#define PT64_LEVEL_MASK(level) \ + (((1ULL << PT64_LEVEL_BITS) - 1) << PT64_LEVEL_SHIFT(level)) + +#define PT64_INDEX(address, level)\ + (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1)) + + +#define PT32_LEVEL_BITS 10 + +#define PT32_LEVEL_SHIFT(level) \ + ( PAGE_SHIFT + (level - 1) * PT32_LEVEL_BITS ) + +#define PT32_LEVEL_MASK(level) \ + (((1ULL << PT32_LEVEL_BITS) - 1) << PT32_LEVEL_SHIFT(level)) + +#define PT32_INDEX(address, level)\ + (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1)) + + +#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & PAGE_MASK) +#define PT64_DIR_BASE_ADDR_MASK \ + (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1)) + +#define PT32_BASE_ADDR_MASK PAGE_MASK +#define PT32_DIR_BASE_ADDR_MASK \ + (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1)) + + +#define PFERR_PRESENT_MASK (1U << 0) +#define PFERR_WRITE_MASK (1U << 1) +#define PFERR_USER_MASK (1U << 2) + +#define PT64_ROOT_LEVEL 4 +#define PT32_ROOT_LEVEL 2 +#define PT32E_ROOT_LEVEL 3 + +#define PT_DIRECTORY_LEVEL 2 +#define PT_PAGE_TABLE_LEVEL 1 + +static int is_write_protection(struct kvm_vcpu *vcpu) +{ + return vcpu->cr0 & CR0_WP_MASK; +} + +static int is_cpuid_PSE36(void) +{ + return 1; +} + +static int is_present_pte(unsigned long pte) +{ + return pte & PT_PRESENT_MASK; +} + +static int is_writeble_pte(unsigned long pte) +{ + return pte & PT_WRITABLE_MASK; +} + +static int is_io_pte(unsigned long pte) +{ + return pte & PT_SHADOW_IO_MARK; +} + +static void kvm_mmu_free_page(struct kvm_vcpu *vcpu, hpa_t page_hpa) +{ + struct kvm_mmu_page *page_head = page_header(page_hpa); + + list_del(&page_head->link); + page_head->page_hpa = page_hpa; + list_add(&page_head->link, &vcpu->free_pages); +} + +static int is_empty_shadow_page(hpa_t page_hpa) +{ + u32 *pos; + u32 *end; + for (pos = __va(page_hpa), end = pos + PAGE_SIZE / sizeof(u32); + pos != end; pos++) + if (*pos != 0) + return 0; + return 1; +} + +static hpa_t kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, u64 *parent_pte) +{ + struct kvm_mmu_page *page; + + if (list_empty(&vcpu->free_pages)) + return INVALID_PAGE; + + page = list_entry(vcpu->free_pages.next, struct kvm_mmu_page, link); + list_del(&page->link); + list_add(&page->link, &vcpu->kvm->active_mmu_pages); + ASSERT(is_empty_shadow_page(page->page_hpa)); + page->slot_bitmap = 0; + page->global = 1; + page->parent_pte = parent_pte; + return page->page_hpa; +} + +static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa) +{ + int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT)); + struct kvm_mmu_page *page_head = page_header(__pa(pte)); + + __set_bit(slot, &page_head->slot_bitmap); +} + +hpa_t safe_gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa) +{ + hpa_t hpa = gpa_to_hpa(vcpu, gpa); + + return is_error_hpa(hpa) ? bad_page_address | (gpa & ~PAGE_MASK): hpa; +} + +hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa) +{ + struct kvm_memory_slot *slot; + struct page *page; + + ASSERT((gpa & HPA_ERR_MASK) == 0); + slot = gfn_to_memslot(vcpu->kvm, gpa >> PAGE_SHIFT); + if (!slot) + return gpa | HPA_ERR_MASK; + page = gfn_to_page(slot, gpa >> PAGE_SHIFT); + return ((hpa_t)page_to_pfn(page) << PAGE_SHIFT) + | (gpa & (PAGE_SIZE-1)); +} + +hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva) +{ + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, gva); + + if (gpa == UNMAPPED_GVA) + return UNMAPPED_GVA; + return gpa_to_hpa(vcpu, gpa); +} + + +static void release_pt_page_64(struct kvm_vcpu *vcpu, hpa_t page_hpa, + int level) +{ + ASSERT(vcpu); + ASSERT(VALID_PAGE(page_hpa)); + ASSERT(level <= PT64_ROOT_LEVEL && level > 0); + + if (level == 1) + memset(__va(page_hpa), 0, PAGE_SIZE); + else { + u64 *pos; + u64 *end; + + for (pos = __va(page_hpa), end = pos + PT64_ENT_PER_PAGE; + pos != end; pos++) { + u64 current_ent = *pos; + + *pos = 0; + if (is_present_pte(current_ent)) + release_pt_page_64(vcpu, + current_ent & + PT64_BASE_ADDR_MASK, + level - 1); + } + } + kvm_mmu_free_page(vcpu, page_hpa); +} + +static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) +{ +} + +static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, hpa_t p) +{ + int level = PT32E_ROOT_LEVEL; + hpa_t table_addr = vcpu->mmu.root_hpa; + + for (; ; level--) { + u32 index = PT64_INDEX(v, level); + u64 *table; + + ASSERT(VALID_PAGE(table_addr)); + table = __va(table_addr); + + if (level == 1) { + mark_page_dirty(vcpu->kvm, v >> PAGE_SHIFT); + page_header_update_slot(vcpu->kvm, table, v); + table[index] = p | PT_PRESENT_MASK | PT_WRITABLE_MASK | + PT_USER_MASK; + return 0; + } + + if (table[index] == 0) { + hpa_t new_table = kvm_mmu_alloc_page(vcpu, + &table[index]); + + if (!VALID_PAGE(new_table)) { + pgprintk("nonpaging_map: ENOMEM\n"); + return -ENOMEM; + } + + if (level == PT32E_ROOT_LEVEL) + table[index] = new_table | PT_PRESENT_MASK; + else + table[index] = new_table | PT_PRESENT_MASK | + PT_WRITABLE_MASK | PT_USER_MASK; + } + table_addr = table[index] & PT64_BASE_ADDR_MASK; + } +} + +static void nonpaging_flush(struct kvm_vcpu *vcpu) +{ + hpa_t root = vcpu->mmu.root_hpa; + + ++kvm_stat.tlb_flush; + pgprintk("nonpaging_flush\n"); + ASSERT(VALID_PAGE(root)); + release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level); + root = kvm_mmu_alloc_page(vcpu, 0); + ASSERT(VALID_PAGE(root)); + vcpu->mmu.root_hpa = root; + if (is_paging(vcpu)) + root |= (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)); + kvm_arch_ops->set_cr3(vcpu, root); + kvm_arch_ops->tlb_flush(vcpu); +} + +static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr) +{ + return vaddr; +} + +static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, + u32 error_code) +{ + int ret; + gpa_t addr = gva; + + ASSERT(vcpu); + ASSERT(VALID_PAGE(vcpu->mmu.root_hpa)); + + for (;;) { + hpa_t paddr; + + paddr = gpa_to_hpa(vcpu , addr & PT64_BASE_ADDR_MASK); + + if (is_error_hpa(paddr)) + return 1; + + ret = nonpaging_map(vcpu, addr & PAGE_MASK, paddr); + if (ret) { + nonpaging_flush(vcpu); + continue; + } + break; + } + return ret; +} + +static void nonpaging_inval_page(struct kvm_vcpu *vcpu, gva_t addr) +{ +} + +static void nonpaging_free(struct kvm_vcpu *vcpu) +{ + hpa_t root; + + ASSERT(vcpu); + root = vcpu->mmu.root_hpa; + if (VALID_PAGE(root)) + release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level); + vcpu->mmu.root_hpa = INVALID_PAGE; +} + +static int nonpaging_init_context(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *context = &vcpu->mmu; + + context->new_cr3 = nonpaging_new_cr3; + context->page_fault = nonpaging_page_fault; + context->inval_page = nonpaging_inval_page; + context->gva_to_gpa = nonpaging_gva_to_gpa; + context->free = nonpaging_free; + context->root_level = PT32E_ROOT_LEVEL; + context->shadow_root_level = PT32E_ROOT_LEVEL; + context->root_hpa = kvm_mmu_alloc_page(vcpu, 0); + ASSERT(VALID_PAGE(context->root_hpa)); + kvm_arch_ops->set_cr3(vcpu, context->root_hpa); + return 0; +} + + +static void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu_page *page, *npage; + + list_for_each_entry_safe(page, npage, &vcpu->kvm->active_mmu_pages, + link) { + if (page->global) + continue; + + if (!page->parent_pte) + continue; + + *page->parent_pte = 0; + release_pt_page_64(vcpu, page->page_hpa, 1); + } + ++kvm_stat.tlb_flush; + kvm_arch_ops->tlb_flush(vcpu); +} + +static void paging_new_cr3(struct kvm_vcpu *vcpu) +{ + kvm_mmu_flush_tlb(vcpu); +} + +static void mark_pagetable_nonglobal(void *shadow_pte) +{ + page_header(__pa(shadow_pte))->global = 0; +} + +static inline void set_pte_common(struct kvm_vcpu *vcpu, + u64 *shadow_pte, + gpa_t gaddr, + int dirty, + u64 access_bits) +{ + hpa_t paddr; + + *shadow_pte |= access_bits << PT_SHADOW_BITS_OFFSET; + if (!dirty) + access_bits &= ~PT_WRITABLE_MASK; + + if (access_bits & PT_WRITABLE_MASK) + mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT); + + *shadow_pte |= access_bits; + + paddr = gpa_to_hpa(vcpu, gaddr & PT64_BASE_ADDR_MASK); + + if (!(*shadow_pte & PT_GLOBAL_MASK)) + mark_pagetable_nonglobal(shadow_pte); + + if (is_error_hpa(paddr)) { + *shadow_pte |= gaddr; + *shadow_pte |= PT_SHADOW_IO_MARK; + *shadow_pte &= ~PT_PRESENT_MASK; + } else { + *shadow_pte |= paddr; + page_header_update_slot(vcpu->kvm, shadow_pte, gaddr); + } +} + +static void inject_page_fault(struct kvm_vcpu *vcpu, + u64 addr, + u32 err_code) +{ + kvm_arch_ops->inject_page_fault(vcpu, addr, err_code); +} + +static inline int fix_read_pf(u64 *shadow_ent) +{ + if ((*shadow_ent & PT_SHADOW_USER_MASK) && + !(*shadow_ent & PT_USER_MASK)) { + /* + * If supervisor write protect is disabled, we shadow kernel + * pages as user pages so we can trap the write access. + */ + *shadow_ent |= PT_USER_MASK; + *shadow_ent &= ~PT_WRITABLE_MASK; + + return 1; + + } + return 0; +} + +static int may_access(u64 pte, int write, int user) +{ + + if (user && !(pte & PT_USER_MASK)) + return 0; + if (write && !(pte & PT_WRITABLE_MASK)) + return 0; + return 1; +} + +/* + * Remove a shadow pte. + */ +static void paging_inval_page(struct kvm_vcpu *vcpu, gva_t addr) +{ + hpa_t page_addr = vcpu->mmu.root_hpa; + int level = vcpu->mmu.shadow_root_level; + + ++kvm_stat.invlpg; + + for (; ; level--) { + u32 index = PT64_INDEX(addr, level); + u64 *table = __va(page_addr); + + if (level == PT_PAGE_TABLE_LEVEL ) { + table[index] = 0; + return; + } + + if (!is_present_pte(table[index])) + return; + + page_addr = table[index] & PT64_BASE_ADDR_MASK; + + if (level == PT_DIRECTORY_LEVEL && + (table[index] & PT_SHADOW_PS_MARK)) { + table[index] = 0; + release_pt_page_64(vcpu, page_addr, PT_PAGE_TABLE_LEVEL); + + kvm_arch_ops->tlb_flush(vcpu); + return; + } + } +} + +static void paging_free(struct kvm_vcpu *vcpu) +{ + nonpaging_free(vcpu); +} + +#define PTTYPE 64 +#include "paging_tmpl.h" +#undef PTTYPE + +#define PTTYPE 32 +#include "paging_tmpl.h" +#undef PTTYPE + +static int paging64_init_context(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *context = &vcpu->mmu; + + ASSERT(is_pae(vcpu)); + context->new_cr3 = paging_new_cr3; + context->page_fault = paging64_page_fault; + context->inval_page = paging_inval_page; + context->gva_to_gpa = paging64_gva_to_gpa; + context->free = paging_free; + context->root_level = PT64_ROOT_LEVEL; + context->shadow_root_level = PT64_ROOT_LEVEL; + context->root_hpa = kvm_mmu_alloc_page(vcpu, 0); + ASSERT(VALID_PAGE(context->root_hpa)); + kvm_arch_ops->set_cr3(vcpu, context->root_hpa | + (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK))); + return 0; +} + +static int paging32_init_context(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *context = &vcpu->mmu; + + context->new_cr3 = paging_new_cr3; + context->page_fault = paging32_page_fault; + context->inval_page = paging_inval_page; + context->gva_to_gpa = paging32_gva_to_gpa; + context->free = paging_free; + context->root_level = PT32_ROOT_LEVEL; + context->shadow_root_level = PT32E_ROOT_LEVEL; + context->root_hpa = kvm_mmu_alloc_page(vcpu, 0); + ASSERT(VALID_PAGE(context->root_hpa)); + kvm_arch_ops->set_cr3(vcpu, context->root_hpa | + (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK))); + return 0; +} + +static int paging32E_init_context(struct kvm_vcpu *vcpu) +{ + int ret; + + if ((ret = paging64_init_context(vcpu))) + return ret; + + vcpu->mmu.root_level = PT32E_ROOT_LEVEL; + vcpu->mmu.shadow_root_level = PT32E_ROOT_LEVEL; + return 0; +} + +static int init_kvm_mmu(struct kvm_vcpu *vcpu) +{ + ASSERT(vcpu); + ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa)); + + if (!is_paging(vcpu)) + return nonpaging_init_context(vcpu); + else if (kvm_arch_ops->is_long_mode(vcpu)) + return paging64_init_context(vcpu); + else if (is_pae(vcpu)) + return paging32E_init_context(vcpu); + else + return paging32_init_context(vcpu); +} + +static void destroy_kvm_mmu(struct kvm_vcpu *vcpu) +{ + ASSERT(vcpu); + if (VALID_PAGE(vcpu->mmu.root_hpa)) { + vcpu->mmu.free(vcpu); + vcpu->mmu.root_hpa = INVALID_PAGE; + } +} + +int kvm_mmu_reset_context(struct kvm_vcpu *vcpu) +{ + destroy_kvm_mmu(vcpu); + return init_kvm_mmu(vcpu); +} + +static void free_mmu_pages(struct kvm_vcpu *vcpu) +{ + while (!list_empty(&vcpu->free_pages)) { + struct kvm_mmu_page *page; + + page = list_entry(vcpu->free_pages.next, + struct kvm_mmu_page, link); + list_del(&page->link); + __free_page(pfn_to_page(page->page_hpa >> PAGE_SHIFT)); + page->page_hpa = INVALID_PAGE; + } +} + +static int alloc_mmu_pages(struct kvm_vcpu *vcpu) +{ + int i; + + ASSERT(vcpu); + + for (i = 0; i < KVM_NUM_MMU_PAGES; i++) { + struct page *page; + struct kvm_mmu_page *page_header = &vcpu->page_header_buf[i]; + + INIT_LIST_HEAD(&page_header->link); + if ((page = alloc_page(GFP_KVM_MMU)) == NULL) + goto error_1; + page->private = (unsigned long)page_header; + page_header->page_hpa = (hpa_t)page_to_pfn(page) << PAGE_SHIFT; + memset(__va(page_header->page_hpa), 0, PAGE_SIZE); + list_add(&page_header->link, &vcpu->free_pages); + } + return 0; + +error_1: + free_mmu_pages(vcpu); + return -ENOMEM; +} + +int kvm_mmu_init(struct kvm_vcpu *vcpu) +{ + int r; + + ASSERT(vcpu); + ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa)); + ASSERT(list_empty(&vcpu->free_pages)); + + if ((r = alloc_mmu_pages(vcpu))) + return r; + + if ((r = init_kvm_mmu(vcpu))) { + free_mmu_pages(vcpu); + return r; + } + return 0; +} + +void kvm_mmu_destroy(struct kvm_vcpu *vcpu) +{ + ASSERT(vcpu); + + destroy_kvm_mmu(vcpu); + free_mmu_pages(vcpu); +} + +void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) +{ + struct kvm_mmu_page *page; + + list_for_each_entry(page, &kvm->active_mmu_pages, link) { + int i; + u64 *pt; + + if (!test_bit(slot, &page->slot_bitmap)) + continue; + + pt = __va(page->page_hpa); + for (i = 0; i < PT64_ENT_PER_PAGE; ++i) + /* avoid RMW */ + if (pt[i] & PT_WRITABLE_MASK) + pt[i] &= ~PT_WRITABLE_MASK; + + } +} Index: linux/drivers/kvm/paging_tmpl.h =================================================================== --- /dev/null +++ linux/drivers/kvm/paging_tmpl.h @@ -0,0 +1,397 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables machines with Intel VT-x extensions to run virtual + * machines without emulation or binary translation. + * + * MMU support + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +/* + * We need the mmu code to access both 32-bit and 64-bit guest ptes, + * so the code in this file is compiled twice, once per pte size. + */ + +#if PTTYPE == 64 + #define pt_element_t u64 + #define guest_walker guest_walker64 + #define FNAME(name) paging##64_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_DIR_BASE_ADDR_MASK PT64_DIR_BASE_ADDR_MASK + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_MASK(level) PT64_LEVEL_MASK(level) + #define PT_PTE_COPY_MASK PT64_PTE_COPY_MASK + #define PT_NON_PTE_COPY_MASK PT64_NON_PTE_COPY_MASK +#elif PTTYPE == 32 + #define pt_element_t u32 + #define guest_walker guest_walker32 + #define FNAME(name) paging##32_##name + #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK + #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK + #define PT_INDEX(addr, level) PT32_INDEX(addr, level) + #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level) + #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK + #define PT_NON_PTE_COPY_MASK PT32_NON_PTE_COPY_MASK +#else + #error Invalid PTTYPE value +#endif + +/* + * The guest_walker structure emulates the behavior of the hardware page + * table walker. + */ +struct guest_walker { + int level; + pt_element_t *table; + pt_element_t inherited_ar; +}; + +static void FNAME(init_walker)(struct guest_walker *walker, + struct kvm_vcpu *vcpu) +{ + hpa_t hpa; + struct kvm_memory_slot *slot; + + walker->level = vcpu->mmu.root_level; + slot = gfn_to_memslot(vcpu->kvm, + (vcpu->cr3 & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + hpa = safe_gpa_to_hpa(vcpu, vcpu->cr3 & PT64_BASE_ADDR_MASK); + walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0); + + ASSERT((!kvm_arch_ops->is_long_mode(vcpu) && is_pae(vcpu)) || + (vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0); + + walker->table = (pt_element_t *)( (unsigned long)walker->table | + (unsigned long)(vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) ); + walker->inherited_ar = PT_USER_MASK | PT_WRITABLE_MASK; +} + +static void FNAME(release_walker)(struct guest_walker *walker) +{ + kunmap_atomic(walker->table, KM_USER0); +} + +static void FNAME(set_pte)(struct kvm_vcpu *vcpu, u64 guest_pte, + u64 *shadow_pte, u64 access_bits) +{ + ASSERT(*shadow_pte == 0); + access_bits &= guest_pte; + *shadow_pte = (guest_pte & PT_PTE_COPY_MASK); + set_pte_common(vcpu, shadow_pte, guest_pte & PT_BASE_ADDR_MASK, + guest_pte & PT_DIRTY_MASK, access_bits); +} + +static void FNAME(set_pde)(struct kvm_vcpu *vcpu, u64 guest_pde, + u64 *shadow_pte, u64 access_bits, + int index) +{ + gpa_t gaddr; + + ASSERT(*shadow_pte == 0); + access_bits &= guest_pde; + gaddr = (guest_pde & PT_DIR_BASE_ADDR_MASK) + PAGE_SIZE * index; + if (PTTYPE == 32 && is_cpuid_PSE36()) + gaddr |= (guest_pde & PT32_DIR_PSE36_MASK) << + (32 - PT32_DIR_PSE36_SHIFT); + *shadow_pte = (guest_pde & PT_NON_PTE_COPY_MASK) | + ((guest_pde & PT_DIR_PAT_MASK) >> + (PT_DIR_PAT_SHIFT - PT_PAT_SHIFT)); + set_pte_common(vcpu, shadow_pte, gaddr, + guest_pde & PT_DIRTY_MASK, access_bits); +} + +/* + * Fetch a guest pte from a specific level in the paging hierarchy. + */ +static pt_element_t *FNAME(fetch_guest)(struct kvm_vcpu *vcpu, + struct guest_walker *walker, + int level, + gva_t addr) +{ + + ASSERT(level > 0 && level <= walker->level); + + for (;;) { + int index = PT_INDEX(addr, walker->level); + hpa_t paddr; + + ASSERT(((unsigned long)walker->table & PAGE_MASK) == + ((unsigned long)&walker->table[index] & PAGE_MASK)); + if (level == walker->level || + !is_present_pte(walker->table[index]) || + (walker->level == PT_DIRECTORY_LEVEL && + (walker->table[index] & PT_PAGE_SIZE_MASK) && + (PTTYPE == 64 || is_pse(vcpu)))) + return &walker->table[index]; + if (walker->level != 3 || kvm_arch_ops->is_long_mode(vcpu)) + walker->inherited_ar &= walker->table[index]; + paddr = safe_gpa_to_hpa(vcpu, walker->table[index] & PT_BASE_ADDR_MASK); + kunmap_atomic(walker->table, KM_USER0); + walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT), + KM_USER0); + --walker->level; + } +} + +/* + * Fetch a shadow pte for a specific level in the paging hierarchy. + */ +static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, + struct guest_walker *walker) +{ + hpa_t shadow_addr; + int level; + u64 *prev_shadow_ent = NULL; + + shadow_addr = vcpu->mmu.root_hpa; + level = vcpu->mmu.shadow_root_level; + + for (; ; level--) { + u32 index = SHADOW_PT_INDEX(addr, level); + u64 *shadow_ent = ((u64 *)__va(shadow_addr)) + index; + pt_element_t *guest_ent; + + if (is_present_pte(*shadow_ent) || is_io_pte(*shadow_ent)) { + if (level == PT_PAGE_TABLE_LEVEL) + return shadow_ent; + shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK; + prev_shadow_ent = shadow_ent; + continue; + } + + if (PTTYPE == 32 && level > PT32_ROOT_LEVEL) { + ASSERT(level == PT32E_ROOT_LEVEL); + guest_ent = FNAME(fetch_guest)(vcpu, walker, + PT32_ROOT_LEVEL, addr); + } else + guest_ent = FNAME(fetch_guest)(vcpu, walker, + level, addr); + + if (!is_present_pte(*guest_ent)) + return NULL; + + /* Don't set accessed bit on PAE PDPTRs */ + if (vcpu->mmu.root_level != 3 || walker->level != 3) + *guest_ent |= PT_ACCESSED_MASK; + + if (level == PT_PAGE_TABLE_LEVEL) { + + if (walker->level == PT_DIRECTORY_LEVEL) { + if (prev_shadow_ent) + *prev_shadow_ent |= PT_SHADOW_PS_MARK; + FNAME(set_pde)(vcpu, *guest_ent, shadow_ent, + walker->inherited_ar, + PT_INDEX(addr, PT_PAGE_TABLE_LEVEL)); + } else { + ASSERT(walker->level == PT_PAGE_TABLE_LEVEL); + FNAME(set_pte)(vcpu, *guest_ent, shadow_ent, walker->inherited_ar); + } + return shadow_ent; + } + + shadow_addr = kvm_mmu_alloc_page(vcpu, shadow_ent); + if (!VALID_PAGE(shadow_addr)) + return ERR_PTR(-ENOMEM); + if (!kvm_arch_ops->is_long_mode(vcpu) && level == 3) + *shadow_ent = shadow_addr | + (*guest_ent & (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK)); + else { + *shadow_ent = shadow_addr | + (*guest_ent & PT_NON_PTE_COPY_MASK); + *shadow_ent |= (PT_WRITABLE_MASK | PT_USER_MASK); + } + prev_shadow_ent = shadow_ent; + } +} + +/* + * The guest faulted for write. We need to + * + * - check write permissions + * - update the guest pte dirty bit + * - update our own dirty page tracking structures + */ +static int FNAME(fix_write_pf)(struct kvm_vcpu *vcpu, + u64 *shadow_ent, + struct guest_walker *walker, + gva_t addr, + int user) +{ + pt_element_t *guest_ent; + int writable_shadow; + gfn_t gfn; + + if (is_writeble_pte(*shadow_ent)) + return 0; + + writable_shadow = *shadow_ent & PT_SHADOW_WRITABLE_MASK; + if (user) { + /* + * User mode access. Fail if it's a kernel page or a read-only + * page. + */ + if (!(*shadow_ent & PT_SHADOW_USER_MASK) || !writable_shadow) + return 0; + ASSERT(*shadow_ent & PT_USER_MASK); + } else + /* + * Kernel mode access. Fail if it's a read-only page and + * supervisor write protection is enabled. + */ + if (!writable_shadow) { + if (is_write_protection(vcpu)) + return 0; + *shadow_ent &= ~PT_USER_MASK; + } + + guest_ent = FNAME(fetch_guest)(vcpu, walker, PT_PAGE_TABLE_LEVEL, addr); + + if (!is_present_pte(*guest_ent)) { + *shadow_ent = 0; + return 0; + } + + gfn = (*guest_ent & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; + mark_page_dirty(vcpu->kvm, gfn); + *shadow_ent |= PT_WRITABLE_MASK; + *guest_ent |= PT_DIRTY_MASK; + + return 1; +} + +/* + * Page fault handler. There are several causes for a page fault: + * - there is no shadow pte for the guest pte + * - write access through a shadow pte marked read only so that we can set + * the dirty bit + * - write access to a shadow pte marked read only so we can update the page + * dirty bitmap, when userspace requests it + * - mmio access; in this case we will never install a present shadow pte + * - normal guest page fault due to the guest pte marked not present, not + * writable, or not executable + * + * Returns: 1 if we need to emulate the instruction, 0 otherwise + */ +static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, + u32 error_code) +{ + int write_fault = error_code & PFERR_WRITE_MASK; + int pte_present = error_code & PFERR_PRESENT_MASK; + int user_fault = error_code & PFERR_USER_MASK; + struct guest_walker walker; + u64 *shadow_pte; + int fixed; + + /* + * Look up the shadow pte for the faulting address. + */ + for (;;) { + FNAME(init_walker)(&walker, vcpu); + shadow_pte = FNAME(fetch)(vcpu, addr, &walker); + if (IS_ERR(shadow_pte)) { /* must be -ENOMEM */ + nonpaging_flush(vcpu); + FNAME(release_walker)(&walker); + continue; + } + break; + } + + /* + * The page is not mapped by the guest. Let the guest handle it. + */ + if (!shadow_pte) { + inject_page_fault(vcpu, addr, error_code); + FNAME(release_walker)(&walker); + return 0; + } + + /* + * Update the shadow pte. + */ + if (write_fault) + fixed = FNAME(fix_write_pf)(vcpu, shadow_pte, &walker, addr, + user_fault); + else + fixed = fix_read_pf(shadow_pte); + + FNAME(release_walker)(&walker); + + /* + * mmio: emulate if accessible, otherwise its a guest fault. + */ + if (is_io_pte(*shadow_pte)) { + if (may_access(*shadow_pte, write_fault, user_fault)) + return 1; + pgprintk("%s: io work, no access\n", __FUNCTION__); + inject_page_fault(vcpu, addr, + error_code | PFERR_PRESENT_MASK); + return 0; + } + + /* + * pte not present, guest page fault. + */ + if (pte_present && !fixed) { + inject_page_fault(vcpu, addr, error_code); + return 0; + } + + ++kvm_stat.pf_fixed; + + return 0; +} + +static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr) +{ + struct guest_walker walker; + pt_element_t guest_pte; + gpa_t gpa; + + FNAME(init_walker)(&walker, vcpu); + guest_pte = *FNAME(fetch_guest)(vcpu, &walker, PT_PAGE_TABLE_LEVEL, + vaddr); + FNAME(release_walker)(&walker); + + if (!is_present_pte(guest_pte)) + return UNMAPPED_GVA; + + if (walker.level == PT_DIRECTORY_LEVEL) { + ASSERT((guest_pte & PT_PAGE_SIZE_MASK)); + ASSERT(PTTYPE == 64 || is_pse(vcpu)); + + gpa = (guest_pte & PT_DIR_BASE_ADDR_MASK) | (vaddr & + (PT_LEVEL_MASK(PT_PAGE_TABLE_LEVEL) | ~PAGE_MASK)); + + if (PTTYPE == 32 && is_cpuid_PSE36()) + gpa |= (guest_pte & PT32_DIR_PSE36_MASK) << + (32 - PT32_DIR_PSE36_SHIFT); + } else { + gpa = (guest_pte & PT_BASE_ADDR_MASK); + gpa |= (vaddr & ~PAGE_MASK); + } + + return gpa; +} + +#undef pt_element_t +#undef guest_walker +#undef FNAME +#undef PT_BASE_ADDR_MASK +#undef PT_INDEX +#undef SHADOW_PT_INDEX +#undef PT_LEVEL_MASK +#undef PT_PTE_COPY_MASK +#undef PT_NON_PTE_COPY_MASK +#undef PT_DIR_BASE_ADDR_MASK Index: linux/drivers/kvm/segment_descriptor.h =================================================================== --- /dev/null +++ linux/drivers/kvm/segment_descriptor.h @@ -0,0 +1,17 @@ +struct segment_descriptor { + u16 limit_low; + u16 base_low; + u8 base_mid; + u8 type : 4; + u8 system : 1; + u8 dpl : 2; + u8 present : 1; + u8 limit_high : 4; + u8 avl : 1; + u8 long_mode : 1; + u8 default_op : 1; + u8 granularity : 1; + u8 base_high; +} __attribute__((packed)); + + Index: linux/drivers/kvm/svm.c =================================================================== --- /dev/null +++ linux/drivers/kvm/svm.c @@ -0,0 +1,1627 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM support + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include +#include +#include +#include + +#include "kvm_svm.h" +#include "x86_emulate.h" + +MODULE_AUTHOR("Qumranet"); +MODULE_LICENSE("GPL"); + +#define IOPM_ALLOC_ORDER 2 +#define MSRPM_ALLOC_ORDER 1 + +#define DB_VECTOR 1 +#define UD_VECTOR 6 +#define GP_VECTOR 13 + +#define DR7_GD_MASK (1 << 13) +#define DR6_BD_MASK (1 << 13) +#define CR4_DE_MASK (1UL << 3) + +#define SEG_TYPE_LDT 2 +#define SEG_TYPE_BUSY_TSS16 3 + +unsigned long iopm_base; +unsigned long msrpm_base; + +struct svm_cpu_data { + int cpu; + + uint64_t asid_generation; + uint32_t max_asid; + uint32_t next_asid; + struct ldttss_desc *tss_desc; + + struct page *save_area; +}; + +static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); + +struct svm_init_data { + int cpu; + int r; +}; + +static u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; + +#define NUM_MSR_MAPS (sizeof(msrpm_ranges) / sizeof(*msrpm_ranges)) +#define MSRS_RANGE_SIZE 2048 +#define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) + +#define MAX_INST_SIZE 15 + +static unsigned get_addr_size(struct kvm_vcpu *vcpu) +{ + struct vmcb_save_area *sa = &vcpu->svm->vmcb->save; + u16 cs_attrib; + + if (!(sa->cr0 & CR0_PE_MASK) || (sa->rflags & X86_EFLAGS_VM)) + return 2; + + cs_attrib = sa->cs.attrib; + + return (cs_attrib & SVM_SELECTOR_L_MASK) ? 8 : + (cs_attrib & SVM_SELECTOR_DB_MASK) ? 4 : 2; +} + +static inline u8 pop_irq(struct kvm_vcpu *vcpu) +{ + int word_index = __ffs(vcpu->irq_summary); + int bit_index = __ffs(vcpu->irq_pending[word_index]); + int irq = word_index * BITS_PER_LONG + bit_index; + + clear_bit(bit_index, &vcpu->irq_pending[word_index]); + if (!vcpu->irq_pending[word_index]) + clear_bit(word_index, &vcpu->irq_summary); + return irq; +} + +static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq) +{ + set_bit(irq, vcpu->irq_pending); + set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary); +} + +static inline void clgi(void) +{ + asm volatile (SVM_CLGI); +} + +static inline void stgi(void) +{ + asm volatile (SVM_STGI); +} + +static inline void invlpga(unsigned long addr, u32 asid) +{ + asm volatile (SVM_INVLPGA :: "a"(addr), "c"(asid)); +} + +static inline unsigned long kvm_read_cr2(void) +{ + unsigned long cr2; + + asm volatile ("mov %%cr2, %0" : "=r" (cr2)); + return cr2; +} + +static inline void kvm_write_cr2(unsigned long val) +{ + asm volatile ("mov %0, %%cr2" :: "r" (val)); +} + +static inline unsigned long read_dr6(void) +{ + unsigned long dr6; + + asm volatile ("mov %%dr6, %0" : "=r" (dr6)); + return dr6; +} + +static inline void write_dr6(unsigned long val) +{ + asm volatile ("mov %0, %%dr6" :: "r" (val)); +} + +static inline unsigned long read_dr7(void) +{ + unsigned long dr7; + + asm volatile ("mov %%dr7, %0" : "=r" (dr7)); + return dr7; +} + +static inline void write_dr7(unsigned long val) +{ + asm volatile ("mov %0, %%dr7" :: "r" (val)); +} + +static inline int svm_is_long_mode(struct kvm_vcpu *vcpu) +{ + return vcpu->svm->vmcb->save.efer & EFER_LMA; +} + +static inline void force_new_asid(struct kvm_vcpu *vcpu) +{ + vcpu->svm->asid_generation--; +} + +static inline void flush_guest_tlb(struct kvm_vcpu *vcpu) +{ + force_new_asid(vcpu); +} + +static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) +{ + if (!(efer & EFER_LMA)) + efer &= ~EFER_LME; + + vcpu->svm->vmcb->save.efer = efer | MSR_EFER_SVME_MASK; + vcpu->shadow_efer = efer; +} + +static void svm_inject_gp(struct kvm_vcpu *vcpu, unsigned error_code) +{ + vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_VALID_ERR | + SVM_EVTINJ_TYPE_EXEPT | + GP_VECTOR; + vcpu->svm->vmcb->control.event_inj_err = error_code; +} + +static void inject_ud(struct kvm_vcpu *vcpu) +{ + vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_TYPE_EXEPT | + UD_VECTOR; +} + +static void inject_db(struct kvm_vcpu *vcpu) +{ + vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_TYPE_EXEPT | + DB_VECTOR; +} + +static int is_page_fault(uint32_t info) +{ + info &= SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | SVM_EVTINJ_VALID; + return info == (PF_VECTOR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT); +} + +static int is_external_interrupt(u32 info) +{ + info &= SVM_EVTINJ_TYPE_MASK | SVM_EVTINJ_VALID; + return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR); +} + +static void skip_emulated_instruction(struct kvm_vcpu *vcpu) +{ + if (!vcpu->svm->next_rip) { + printk(KERN_DEBUG "%s: NOP\n", __FUNCTION__); + return; + } + if (vcpu->svm->next_rip - vcpu->svm->vmcb->save.rip > 15) { + printk(KERN_ERR "%s: ip 0x%llx next 0x%llx\n", + __FUNCTION__, + vcpu->svm->vmcb->save.rip, + vcpu->svm->next_rip); + } + + vcpu->rip = vcpu->svm->vmcb->save.rip = vcpu->svm->next_rip; + vcpu->svm->vmcb->control.int_state &= ~SVM_INTERRUPT_SHADOW_MASK; +} + +static int has_svm(void) +{ + uint32_t eax, ebx, ecx, edx; + + if (current_cpu_data.x86_vendor != X86_VENDOR_AMD) { + printk(KERN_INFO "has_svm: not amd\n"); + return 0; + } + + cpuid(0x80000000, &eax, &ebx, &ecx, &edx); + if (eax < SVM_CPUID_FUNC) { + printk(KERN_INFO "has_svm: can't execute cpuid_8000000a\n"); + return 0; + } + + cpuid(0x80000001, &eax, &ebx, &ecx, &edx); + if (!(ecx & (1 << SVM_CPUID_FEATURE_SHIFT))) { + printk(KERN_DEBUG "has_svm: svm not available\n"); + return 0; + } + return 1; +} + +static void svm_hardware_disable(void *garbage) +{ + struct svm_cpu_data *svm_data + = per_cpu(svm_data, raw_smp_processor_id()); + + if (svm_data) { + uint64_t efer; + + wrmsrl(MSR_VM_HSAVE_PA, 0); + rdmsrl(MSR_EFER, efer); + wrmsrl(MSR_EFER, efer & ~MSR_EFER_SVME_MASK); + per_cpu(svm_data, raw_smp_processor_id()) = 0; + __free_page(svm_data->save_area); + kfree(svm_data); + } +} + +static void svm_hardware_enable(void *garbage) +{ + + struct svm_cpu_data *svm_data; + uint64_t efer; + struct desc_ptr gdt_descr; + struct desc_struct *gdt; + int me = raw_smp_processor_id(); + + if (!has_svm()) { + printk(KERN_ERR "svm_cpu_init: err EOPNOTSUPP on %d\n", me); + return; + } + svm_data = per_cpu(svm_data, me); + + if (!svm_data) { + printk(KERN_ERR "svm_cpu_init: svm_data is NULL on %d\n", + me); + return; + } + + svm_data->asid_generation = 1; + svm_data->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1; + svm_data->next_asid = svm_data->max_asid + 1; + + asm volatile ( "sgdt %0" : "=m"(gdt_descr) ); + gdt = (struct desc_struct *)gdt_descr.address; + svm_data->tss_desc = (struct ldttss_desc *)(gdt + GDT_ENTRY_TSS); + + rdmsrl(MSR_EFER, efer); + wrmsrl(MSR_EFER, efer | MSR_EFER_SVME_MASK); + + wrmsrl(MSR_VM_HSAVE_PA, + page_to_pfn(svm_data->save_area) << PAGE_SHIFT); +} + +static int svm_cpu_init(int cpu) +{ + struct svm_cpu_data *svm_data; + int r; + + svm_data = kzalloc(sizeof(struct svm_cpu_data), GFP_KERNEL); + if (!svm_data) + return -ENOMEM; + svm_data->cpu = cpu; + svm_data->save_area = alloc_page(GFP_KERNEL); + r = -ENOMEM; + if (!svm_data->save_area) + goto err_1; + + per_cpu(svm_data, cpu) = svm_data; + + return 0; + +err_1: + kfree(svm_data); + return r; + +} + +static int set_msr_interception(u32 *msrpm, unsigned msr, + int read, int write) +{ + int i; + + for (i = 0; i < NUM_MSR_MAPS; i++) { + if (msr >= msrpm_ranges[i] && + msr < msrpm_ranges[i] + MSRS_IN_RANGE) { + u32 msr_offset = (i * MSRS_IN_RANGE + msr - + msrpm_ranges[i]) * 2; + + u32 *base = msrpm + (msr_offset / 32); + u32 msr_shift = msr_offset % 32; + u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1); + *base = (*base & ~(0x3 << msr_shift)) | + (mask << msr_shift); + return 1; + } + } + printk(KERN_DEBUG "%s: not found 0x%x\n", __FUNCTION__, msr); + return 0; +} + +static __init int svm_hardware_setup(void) +{ + int cpu; + struct page *iopm_pages; + struct page *msrpm_pages; + void *msrpm_va; + int r; + + + iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER); + + if (!iopm_pages) + return -ENOMEM; + memset(page_address(iopm_pages), 0xff, + PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; + + + msrpm_pages = alloc_pages(GFP_KERNEL, MSRPM_ALLOC_ORDER); + + r = -ENOMEM; + if (!msrpm_pages) + goto err_1; + + msrpm_va = page_address(msrpm_pages); + memset(msrpm_va, 0xff, PAGE_SIZE * (1 << MSRPM_ALLOC_ORDER)); + msrpm_base = page_to_pfn(msrpm_pages) << PAGE_SHIFT; + + set_msr_interception(msrpm_va, MSR_GS_BASE, 1, 1); + set_msr_interception(msrpm_va, MSR_FS_BASE, 1, 1); + set_msr_interception(msrpm_va, MSR_KERNEL_GS_BASE, 1, 1); + set_msr_interception(msrpm_va, MSR_STAR, 1, 1); + set_msr_interception(msrpm_va, MSR_LSTAR, 1, 1); + set_msr_interception(msrpm_va, MSR_CSTAR, 1, 1); + set_msr_interception(msrpm_va, MSR_SYSCALL_MASK, 1, 1); + set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_CS, 1, 1); + set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_ESP, 1, 1); + set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_EIP, 1, 1); + + for_each_online_cpu(cpu) { + r = svm_cpu_init(cpu); + if (r) + goto err_2; + } + return 0; + +err_2: + __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER); + msrpm_base = 0; +err_1: + __free_pages(iopm_pages, IOPM_ALLOC_ORDER); + iopm_base = 0; + return r; +} + +static __exit void svm_hardware_unsetup(void) +{ + __free_pages(pfn_to_page(msrpm_base >> PAGE_SHIFT), MSRPM_ALLOC_ORDER); + __free_pages(pfn_to_page(iopm_base >> PAGE_SHIFT), IOPM_ALLOC_ORDER); + iopm_base = msrpm_base = 0; +} + +static void init_seg(struct vmcb_seg *seg) +{ + seg->selector = 0; + seg->attrib = SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK | + SVM_SELECTOR_WRITE_MASK; /* Read/Write Data Segment */ + seg->limit = 0xffff; + seg->base = 0; +} + +static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) +{ + seg->selector = 0; + seg->attrib = SVM_SELECTOR_P_MASK | type; + seg->limit = 0xffff; + seg->base = 0; +} + +static int svm_vcpu_setup(struct kvm_vcpu *vcpu) +{ + return 0; +} + +static void init_vmcb(struct vmcb *vmcb) +{ + struct vmcb_control_area *control = &vmcb->control; + struct vmcb_save_area *save = &vmcb->save; + u64 tsc; + + control->intercept_cr_read = INTERCEPT_CR0_MASK | + INTERCEPT_CR3_MASK | + INTERCEPT_CR4_MASK; + + control->intercept_cr_write = INTERCEPT_CR0_MASK | + INTERCEPT_CR3_MASK | + INTERCEPT_CR4_MASK; + + control->intercept_dr_read = INTERCEPT_DR0_MASK | + INTERCEPT_DR1_MASK | + INTERCEPT_DR2_MASK | + INTERCEPT_DR3_MASK; + + control->intercept_dr_write = INTERCEPT_DR0_MASK | + INTERCEPT_DR1_MASK | + INTERCEPT_DR2_MASK | + INTERCEPT_DR3_MASK | + INTERCEPT_DR5_MASK | + INTERCEPT_DR7_MASK; + + control->intercept_exceptions = 1 << PF_VECTOR; + + + control->intercept = (1ULL << INTERCEPT_INTR) | + (1ULL << INTERCEPT_NMI) | + /* + * selective cr0 intercept bug? + * 0: 0f 22 d8 mov %eax,%cr3 + * 3: 0f 20 c0 mov %cr0,%eax + * 6: 0d 00 00 00 80 or $0x80000000,%eax + * b: 0f 22 c0 mov %eax,%cr0 + * set cr3 ->interception + * get cr0 ->interception + * set cr0 -> no interception + */ + /* (1ULL << INTERCEPT_SELECTIVE_CR0) | */ + (1ULL << INTERCEPT_CPUID) | + (1ULL << INTERCEPT_HLT) | + (1ULL << INTERCEPT_INVLPG) | + (1ULL << INTERCEPT_INVLPGA) | + (1ULL << INTERCEPT_IOIO_PROT) | + (1ULL << INTERCEPT_MSR_PROT) | + (1ULL << INTERCEPT_TASK_SWITCH) | + (1ULL << INTERCEPT_VMRUN) | + (1ULL << INTERCEPT_VMMCALL) | + (1ULL << INTERCEPT_VMLOAD) | + (1ULL << INTERCEPT_VMSAVE) | + (1ULL << INTERCEPT_STGI) | + (1ULL << INTERCEPT_CLGI) | + (1ULL << INTERCEPT_SKINIT); + + control->iopm_base_pa = iopm_base; + control->msrpm_base_pa = msrpm_base; + rdtscll(tsc); + control->tsc_offset = -tsc; + control->int_ctl = V_INTR_MASKING_MASK; + + init_seg(&save->es); + init_seg(&save->ss); + init_seg(&save->ds); + init_seg(&save->fs); + init_seg(&save->gs); + + save->cs.selector = 0xf000; + /* Executable/Readable Code Segment */ + save->cs.attrib = SVM_SELECTOR_READ_MASK | SVM_SELECTOR_P_MASK | + SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK; + save->cs.limit = 0xffff; + save->cs.base = 0xffff0000; + + save->gdtr.limit = 0xffff; + save->idtr.limit = 0xffff; + + init_sys_seg(&save->ldtr, SEG_TYPE_LDT); + init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); + + save->efer = MSR_EFER_SVME_MASK; + + save->dr6 = 0xffff0ff0; + save->dr7 = 0x400; + save->rflags = 2; + save->rip = 0x0000fff0; + + /* + * cr0 val on cpu init should be 0x60000010, we enable cpu + * cache by default. the orderly way is to enable cache in bios. + */ + save->cr0 = 0x00000010 | CR0_PG_MASK; + save->cr4 = CR4_PAE_MASK; + /* rdx = ?? */ +} + +static int svm_create_vcpu(struct kvm_vcpu *vcpu) +{ + struct page *page; + int r; + + r = -ENOMEM; + vcpu->svm = kzalloc(sizeof *vcpu->svm, GFP_KERNEL); + if (!vcpu->svm) + goto out1; + page = alloc_page(GFP_KERNEL); + if (!page) + goto out2; + + vcpu->svm->vmcb = page_address(page); + memset(vcpu->svm->vmcb, 0, PAGE_SIZE); + vcpu->svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT; + vcpu->svm->cr0 = 0x00000010; + vcpu->svm->asid_generation = 0; + memset(vcpu->svm->db_regs, 0, sizeof(vcpu->svm->db_regs)); + init_vmcb(vcpu->svm->vmcb); + + return 0; + +out2: + kfree(vcpu->svm); +out1: + return r; +} + +static void svm_free_vcpu(struct kvm_vcpu *vcpu) +{ + if (!vcpu->svm) + return; + if (vcpu->svm->vmcb) + __free_page(pfn_to_page(vcpu->svm->vmcb_pa >> PAGE_SHIFT)); + kfree(vcpu->svm); +} + +static struct kvm_vcpu *svm_vcpu_load(struct kvm_vcpu *vcpu) +{ + get_cpu(); + return vcpu; +} + +static void svm_vcpu_put(struct kvm_vcpu *vcpu) +{ + put_cpu(); +} + +static void svm_cache_regs(struct kvm_vcpu *vcpu) +{ + vcpu->regs[VCPU_REGS_RAX] = vcpu->svm->vmcb->save.rax; + vcpu->regs[VCPU_REGS_RSP] = vcpu->svm->vmcb->save.rsp; + vcpu->rip = vcpu->svm->vmcb->save.rip; +} + +static void svm_decache_regs(struct kvm_vcpu *vcpu) +{ + vcpu->svm->vmcb->save.rax = vcpu->regs[VCPU_REGS_RAX]; + vcpu->svm->vmcb->save.rsp = vcpu->regs[VCPU_REGS_RSP]; + vcpu->svm->vmcb->save.rip = vcpu->rip; +} + +static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) +{ + return vcpu->svm->vmcb->save.rflags; +} + +static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) +{ + vcpu->svm->vmcb->save.rflags = rflags; +} + +static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg) +{ + struct vmcb_save_area *save = &vcpu->svm->vmcb->save; + + switch (seg) { + case VCPU_SREG_CS: return &save->cs; + case VCPU_SREG_DS: return &save->ds; + case VCPU_SREG_ES: return &save->es; + case VCPU_SREG_FS: return &save->fs; + case VCPU_SREG_GS: return &save->gs; + case VCPU_SREG_SS: return &save->ss; + case VCPU_SREG_TR: return &save->tr; + case VCPU_SREG_LDTR: return &save->ldtr; + } + BUG(); + return 0; +} + +static u64 svm_get_segment_base(struct kvm_vcpu *vcpu, int seg) +{ + struct vmcb_seg *s = svm_seg(vcpu, seg); + + return s->base; +} + +static void svm_get_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct vmcb_seg *s = svm_seg(vcpu, seg); + + var->base = s->base; + var->limit = s->limit; + var->selector = s->selector; + var->type = s->attrib & SVM_SELECTOR_TYPE_MASK; + var->s = (s->attrib >> SVM_SELECTOR_S_SHIFT) & 1; + var->dpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3; + var->present = (s->attrib >> SVM_SELECTOR_P_SHIFT) & 1; + var->avl = (s->attrib >> SVM_SELECTOR_AVL_SHIFT) & 1; + var->l = (s->attrib >> SVM_SELECTOR_L_SHIFT) & 1; + var->db = (s->attrib >> SVM_SELECTOR_DB_SHIFT) & 1; + var->g = (s->attrib >> SVM_SELECTOR_G_SHIFT) & 1; + var->unusable = !var->present; +} + +static void svm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) +{ + struct vmcb_seg *s = svm_seg(vcpu, VCPU_SREG_CS); + + *db = (s->attrib >> SVM_SELECTOR_DB_SHIFT) & 1; + *l = (s->attrib >> SVM_SELECTOR_L_SHIFT) & 1; +} + +static void svm_get_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + dt->limit = vcpu->svm->vmcb->save.ldtr.limit; + dt->base = vcpu->svm->vmcb->save.ldtr.base; +} + +static void svm_set_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + vcpu->svm->vmcb->save.ldtr.limit = dt->limit; + vcpu->svm->vmcb->save.ldtr.base = dt->base ; +} + +static void svm_get_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + dt->limit = vcpu->svm->vmcb->save.gdtr.limit; + dt->base = vcpu->svm->vmcb->save.gdtr.base; +} + +static void svm_set_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + vcpu->svm->vmcb->save.gdtr.limit = dt->limit; + vcpu->svm->vmcb->save.gdtr.base = dt->base ; +} + +static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) +{ +#ifdef __x86_64__ + if (vcpu->shadow_efer & EFER_LME) { + if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) { + vcpu->shadow_efer |= EFER_LMA; + vcpu->svm->vmcb->save.efer |= EFER_LMA | EFER_LME; + } + + if (is_paging(vcpu) && !(cr0 & CR0_PG_MASK) ) { + vcpu->shadow_efer &= ~EFER_LMA; + vcpu->svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME); + } + } +#endif + vcpu->svm->cr0 = cr0; + vcpu->svm->vmcb->save.cr0 = cr0 | CR0_PG_MASK; + vcpu->cr0 = cr0; +} + +static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +{ + vcpu->cr4 = cr4; + vcpu->svm->vmcb->save.cr4 = cr4 | CR4_PAE_MASK; +} + +static void svm_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct vmcb_seg *s = svm_seg(vcpu, seg); + + s->base = var->base; + s->limit = var->limit; + s->selector = var->selector; + if (var->unusable) + s->attrib = 0; + else { + s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK); + s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT; + s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT; + s->attrib |= (var->present & 1) << SVM_SELECTOR_P_SHIFT; + s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT; + s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT; + s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT; + s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT; + } + if (seg == VCPU_SREG_CS) + vcpu->svm->vmcb->save.cpl + = (vcpu->svm->vmcb->save.cs.attrib + >> SVM_SELECTOR_DPL_SHIFT) & 3; + +} + +/* FIXME: + + vcpu->svm->vmcb->control.int_ctl &= ~V_TPR_MASK; + vcpu->svm->vmcb->control.int_ctl |= (sregs->cr8 & V_TPR_MASK); + +*/ + +static int svm_guest_debug(struct kvm_vcpu *vcpu, struct kvm_debug_guest *dbg) +{ + return -EOPNOTSUPP; +} + +static void load_host_msrs(struct kvm_vcpu *vcpu) +{ + int i; + + for ( i = 0; i < NR_HOST_SAVE_MSRS; i++) + wrmsrl(host_save_msrs[i], vcpu->svm->host_msrs[i]); +} + +static void save_host_msrs(struct kvm_vcpu *vcpu) +{ + int i; + + for ( i = 0; i < NR_HOST_SAVE_MSRS; i++) + rdmsrl(host_save_msrs[i], vcpu->svm->host_msrs[i]); +} + +static void new_asid(struct kvm_vcpu *vcpu, struct svm_cpu_data *svm_data) +{ + if (svm_data->next_asid > svm_data->max_asid) { + ++svm_data->asid_generation; + svm_data->next_asid = 1; + vcpu->svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID; + } + + vcpu->cpu = svm_data->cpu; + vcpu->svm->asid_generation = svm_data->asid_generation; + vcpu->svm->vmcb->control.asid = svm_data->next_asid++; +} + +static void svm_invlpg(struct kvm_vcpu *vcpu, gva_t address) +{ + invlpga(address, vcpu->svm->vmcb->control.asid); // is needed? +} + +static unsigned long svm_get_dr(struct kvm_vcpu *vcpu, int dr) +{ + return vcpu->svm->db_regs[dr]; +} + +static void svm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long value, + int *exception) +{ + *exception = 0; + + if (vcpu->svm->vmcb->save.dr7 & DR7_GD_MASK) { + vcpu->svm->vmcb->save.dr7 &= ~DR7_GD_MASK; + vcpu->svm->vmcb->save.dr6 |= DR6_BD_MASK; + *exception = DB_VECTOR; + return; + } + + switch (dr) { + case 0 ... 3: + vcpu->svm->db_regs[dr] = value; + return; + case 4 ... 5: + if (vcpu->cr4 & CR4_DE_MASK) { + *exception = UD_VECTOR; + return; + } + case 7: { + if (value & ~((1ULL << 32) - 1)) { + *exception = GP_VECTOR; + return; + } + vcpu->svm->vmcb->save.dr7 = value; + return; + } + default: + printk(KERN_DEBUG "%s: unexpected dr %u\n", + __FUNCTION__, dr); + *exception = UD_VECTOR; + return; + } +} + +static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 exit_int_info = vcpu->svm->vmcb->control.exit_int_info; + u64 fault_address; + u32 error_code; + enum emulation_result er; + + if (is_external_interrupt(exit_int_info)) + push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK); + + spin_lock(&vcpu->kvm->lock); + + fault_address = vcpu->svm->vmcb->control.exit_info_2; + error_code = vcpu->svm->vmcb->control.exit_info_1; + if (!vcpu->mmu.page_fault(vcpu, fault_address, error_code)) { + spin_unlock(&vcpu->kvm->lock); + return 1; + } + er = emulate_instruction(vcpu, kvm_run, fault_address, error_code); + spin_unlock(&vcpu->kvm->lock); + + switch (er) { + case EMULATE_DONE: + return 1; + case EMULATE_DO_MMIO: + ++kvm_stat.mmio_exits; + kvm_run->exit_reason = KVM_EXIT_MMIO; + return 0; + case EMULATE_FAIL: + vcpu_printf(vcpu, "%s: emulate fail\n", __FUNCTION__); + break; + default: + BUG(); + } + + kvm_run->exit_reason = KVM_EXIT_UNKNOWN; + return 0; +} + +static int io_get_override(struct kvm_vcpu *vcpu, + struct vmcb_seg **seg, + int *addr_override) +{ + u8 inst[MAX_INST_SIZE]; + unsigned ins_length; + gva_t rip; + int i; + + rip = vcpu->svm->vmcb->save.rip; + ins_length = vcpu->svm->next_rip - rip; + rip += vcpu->svm->vmcb->save.cs.base; + + if (ins_length > MAX_INST_SIZE) + printk(KERN_DEBUG + "%s: inst length err, cs base 0x%llx rip 0x%llx " + "next rip 0x%llx ins_length %u\n", + __FUNCTION__, + vcpu->svm->vmcb->save.cs.base, + vcpu->svm->vmcb->save.rip, + vcpu->svm->vmcb->control.exit_info_2, + ins_length); + + if (kvm_read_guest(vcpu, rip, ins_length, inst) != ins_length) + /* #PF */ + return 0; + + *addr_override = 0; + *seg = 0; + for (i = 0; i < ins_length; i++) + switch (inst[i]) { + case 0xf0: + case 0xf2: + case 0xf3: + case 0x66: + continue; + case 0x67: + *addr_override = 1; + continue; + case 0x2e: + *seg = &vcpu->svm->vmcb->save.cs; + continue; + case 0x36: + *seg = &vcpu->svm->vmcb->save.ss; + continue; + case 0x3e: + *seg = &vcpu->svm->vmcb->save.ds; + continue; + case 0x26: + *seg = &vcpu->svm->vmcb->save.es; + continue; + case 0x64: + *seg = &vcpu->svm->vmcb->save.fs; + continue; + case 0x65: + *seg = &vcpu->svm->vmcb->save.gs; + continue; + default: + return 1; + } + printk(KERN_DEBUG "%s: unexpected\n", __FUNCTION__); + return 0; +} + +static unsigned long io_adress(struct kvm_vcpu *vcpu, int ins, u64 *address) +{ + unsigned long addr_mask; + unsigned long *reg; + struct vmcb_seg *seg; + int addr_override; + struct vmcb_save_area *save_area = &vcpu->svm->vmcb->save; + u16 cs_attrib = save_area->cs.attrib; + unsigned addr_size = get_addr_size(vcpu); + + if (!io_get_override(vcpu, &seg, &addr_override)) + return 0; + + if (addr_override) + addr_size = (addr_size == 2) ? 4: (addr_size >> 1); + + if (ins) { + reg = &vcpu->regs[VCPU_REGS_RDI]; + seg = &vcpu->svm->vmcb->save.es; + } else { + reg = &vcpu->regs[VCPU_REGS_RSI]; + seg = (seg) ? seg : &vcpu->svm->vmcb->save.ds; + } + + addr_mask = ~0ULL >> (64 - (addr_size * 8)); + + if ((cs_attrib & SVM_SELECTOR_L_MASK) && + !(vcpu->svm->vmcb->save.rflags & X86_EFLAGS_VM)) { + *address = (*reg & addr_mask); + return addr_mask; + } + + if (!(seg->attrib & SVM_SELECTOR_P_SHIFT)) { + svm_inject_gp(vcpu, 0); + return 0; + } + + *address = (*reg & addr_mask) + seg->base; + return addr_mask; +} + +static int io_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 io_info = vcpu->svm->vmcb->control.exit_info_1; //address size bug? + int _in = io_info & SVM_IOIO_TYPE_MASK; + + ++kvm_stat.io_exits; + + vcpu->svm->next_rip = vcpu->svm->vmcb->control.exit_info_2; + + kvm_run->exit_reason = KVM_EXIT_IO; + kvm_run->io.port = io_info >> 16; + kvm_run->io.direction = (_in) ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT; + kvm_run->io.size = ((io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT); + kvm_run->io.string = (io_info & SVM_IOIO_STR_MASK) != 0; + kvm_run->io.rep = (io_info & SVM_IOIO_REP_MASK) != 0; + + if (kvm_run->io.string) { + unsigned addr_mask; + + addr_mask = io_adress(vcpu, _in, &kvm_run->io.address); + if (!addr_mask) { + printk("%s: get io address failed\n", __FUNCTION__); + return 1; + } + + if (kvm_run->io.rep) { + kvm_run->io.count = vcpu->regs[VCPU_REGS_RCX] & addr_mask; + kvm_run->io.string_down = (vcpu->svm->vmcb->save.rflags + & X86_EFLAGS_DF) != 0; + } + } else { + kvm_run->io.value = vcpu->svm->vmcb->save.rax; + } + return 0; +} + + +static int nop_on_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + return 1; +} + +static int halt_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 1; + skip_emulated_instruction(vcpu); + if (vcpu->irq_summary && (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF)) + return 1; + + kvm_run->exit_reason = KVM_EXIT_HLT; + return 0; +} + +static int invalid_op_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + inject_ud(vcpu); + return 1; +} + +static int task_switch_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + printk("%s: task swiche is unsupported\n", __FUNCTION__); + kvm_run->exit_reason = KVM_EXIT_UNKNOWN; + return 0; +} + +static int cpuid_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; + kvm_run->exit_reason = KVM_EXIT_CPUID; + return 0; +} + +static int emulate_on_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + if (emulate_instruction(vcpu, 0, 0, 0) != EMULATE_DONE) + printk(KERN_ERR "%s: failed\n", __FUNCTION__); + return 1; +} + +static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) +{ + switch (ecx) { + case MSR_IA32_MC0_CTL: + case MSR_IA32_MCG_STATUS: + case MSR_IA32_MCG_CAP: + case MSR_IA32_MC0_MISC: + case MSR_IA32_MC0_MISC+4: + case MSR_IA32_MC0_MISC+8: + case MSR_IA32_MC0_MISC+12: + case MSR_IA32_MC0_MISC+16: + case MSR_IA32_UCODE_REV: + /* MTRR registers */ + case 0xfe: + case 0x200 ... 0x2ff: + *data = 0; + break; + case MSR_IA32_TIME_STAMP_COUNTER: { + u64 tsc; + + rdtscll(tsc); + *data = vcpu->svm->vmcb->control.tsc_offset + tsc; + break; + } + case MSR_EFER: + *data = vcpu->shadow_efer; + break; + case MSR_IA32_APICBASE: + *data = vcpu->apic_base; + break; + case MSR_STAR: + *data = vcpu->svm->vmcb->save.star; + break; + case MSR_LSTAR: + *data = vcpu->svm->vmcb->save.lstar; + break; + case MSR_CSTAR: + *data = vcpu->svm->vmcb->save.cstar; + break; + case MSR_IA32_SYSENTER_CS: + *data = vcpu->svm->vmcb->save.sysenter_cs; + break; + case MSR_IA32_SYSENTER_EIP: + *data = vcpu->svm->vmcb->save.sysenter_eip; + break; + case MSR_IA32_SYSENTER_ESP: + *data = vcpu->svm->vmcb->save.sysenter_esp; + break; + case MSR_KERNEL_GS_BASE: + *data = vcpu->svm->vmcb->save.kernel_gs_base; + break; + case MSR_SYSCALL_MASK: + *data = vcpu->svm->vmcb->save.sfmask; + break; + default: + printk(KERN_ERR "kvm: unhandled rdmsr: 0x%x\n", ecx); + return 1; + } + return 0; +} + +static int rdmsr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 ecx = vcpu->regs[VCPU_REGS_RCX]; + u64 data; + + if (svm_get_msr(vcpu, ecx, &data)) + svm_inject_gp(vcpu, 0); + else { + vcpu->svm->vmcb->save.rax = data & 0xffffffff; + vcpu->regs[VCPU_REGS_RDX] = data >> 32; + vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; + skip_emulated_instruction(vcpu); + } + return 1; +} + +static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +{ + switch (ecx) { +#ifdef __x86_64__ + case MSR_EFER: + set_efer(vcpu, data); + break; +#endif + case MSR_IA32_MC0_STATUS: + printk(KERN_WARNING "%s: MSR_IA32_MC0_STATUS 0x%llx, nop\n" + , __FUNCTION__, data); + break; + case MSR_IA32_TIME_STAMP_COUNTER: { + u64 tsc; + + rdtscll(tsc); + vcpu->svm->vmcb->control.tsc_offset = data - tsc; + break; + } + case MSR_IA32_UCODE_REV: + case MSR_IA32_UCODE_WRITE: + case 0x200 ... 0x2ff: /* MTRRs */ + break; + case MSR_IA32_APICBASE: + vcpu->apic_base = data; + break; + case MSR_STAR: + vcpu->svm->vmcb->save.star = data; + break; + case MSR_LSTAR: + vcpu->svm->vmcb->save.lstar = data; + break; + case MSR_CSTAR: + vcpu->svm->vmcb->save.cstar = data; + break; + case MSR_IA32_SYSENTER_CS: + vcpu->svm->vmcb->save.sysenter_cs = data; + break; + case MSR_IA32_SYSENTER_EIP: + vcpu->svm->vmcb->save.sysenter_eip = data; + break; + case MSR_IA32_SYSENTER_ESP: + vcpu->svm->vmcb->save.sysenter_esp = data; + break; + case MSR_KERNEL_GS_BASE: + vcpu->svm->vmcb->save.kernel_gs_base = data; + break; + case MSR_SYSCALL_MASK: + vcpu->svm->vmcb->save.sfmask = data; + break; + default: + printk(KERN_ERR "kvm: unhandled wrmsr: %x\n", ecx); + return 1; + } + return 0; +} + +static int wrmsr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 ecx = vcpu->regs[VCPU_REGS_RCX]; + u64 data = (vcpu->svm->vmcb->save.rax & -1u) + | ((u64)(vcpu->regs[VCPU_REGS_RDX] & -1u) << 32); + vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 2; + if (svm_set_msr(vcpu, ecx, data)) + svm_inject_gp(vcpu, 0); + else + skip_emulated_instruction(vcpu); + return 1; +} + +static int msr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + if (vcpu->svm->vmcb->control.exit_info_1) + return wrmsr_interception(vcpu, kvm_run); + else + return rdmsr_interception(vcpu, kvm_run); +} + +static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) = { + [SVM_EXIT_READ_CR0] = emulate_on_interception, + [SVM_EXIT_READ_CR3] = emulate_on_interception, + [SVM_EXIT_READ_CR4] = emulate_on_interception, + /* for now: */ + [SVM_EXIT_WRITE_CR0] = emulate_on_interception, + [SVM_EXIT_WRITE_CR3] = emulate_on_interception, + [SVM_EXIT_WRITE_CR4] = emulate_on_interception, + [SVM_EXIT_READ_DR0] = emulate_on_interception, + [SVM_EXIT_READ_DR1] = emulate_on_interception, + [SVM_EXIT_READ_DR2] = emulate_on_interception, + [SVM_EXIT_READ_DR3] = emulate_on_interception, + [SVM_EXIT_WRITE_DR0] = emulate_on_interception, + [SVM_EXIT_WRITE_DR1] = emulate_on_interception, + [SVM_EXIT_WRITE_DR2] = emulate_on_interception, + [SVM_EXIT_WRITE_DR3] = emulate_on_interception, + [SVM_EXIT_WRITE_DR5] = emulate_on_interception, + [SVM_EXIT_WRITE_DR7] = emulate_on_interception, + [SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception, + [SVM_EXIT_INTR] = nop_on_interception, + [SVM_EXIT_NMI] = nop_on_interception, + [SVM_EXIT_SMI] = nop_on_interception, + [SVM_EXIT_INIT] = nop_on_interception, + /* [SVM_EXIT_CR0_SEL_WRITE] = emulate_on_interception, */ + [SVM_EXIT_CPUID] = cpuid_interception, + [SVM_EXIT_HLT] = halt_interception, + [SVM_EXIT_INVLPG] = emulate_on_interception, + [SVM_EXIT_INVLPGA] = invalid_op_interception, + [SVM_EXIT_IOIO] = io_interception, + [SVM_EXIT_MSR] = msr_interception, + [SVM_EXIT_TASK_SWITCH] = task_switch_interception, + [SVM_EXIT_VMRUN] = invalid_op_interception, + [SVM_EXIT_VMMCALL] = invalid_op_interception, + [SVM_EXIT_VMLOAD] = invalid_op_interception, + [SVM_EXIT_VMSAVE] = invalid_op_interception, + [SVM_EXIT_STGI] = invalid_op_interception, + [SVM_EXIT_CLGI] = invalid_op_interception, + [SVM_EXIT_SKINIT] = invalid_op_interception, +}; + + +static int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 exit_code = vcpu->svm->vmcb->control.exit_code; + + kvm_run->exit_type = KVM_EXIT_TYPE_VM_EXIT; + + if (is_external_interrupt(vcpu->svm->vmcb->control.exit_int_info) && + exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR) + printk(KERN_ERR "%s: unexpected exit_ini_info 0x%x " + "exit_code 0x%x\n", + __FUNCTION__, vcpu->svm->vmcb->control.exit_int_info, + exit_code); + + if (exit_code >= sizeof(svm_exit_handlers) / sizeof(*svm_exit_handlers) + || svm_exit_handlers[exit_code] == 0) { + kvm_run->exit_reason = KVM_EXIT_UNKNOWN; + printk(KERN_ERR "%s: 0x%x @ 0x%llx cr0 0x%lx rflags 0x%llx\n", + __FUNCTION__, + exit_code, + vcpu->svm->vmcb->save.rip, + vcpu->cr0, + vcpu->svm->vmcb->save.rflags); + return 0; + } + + return svm_exit_handlers[exit_code](vcpu, kvm_run); +} + +static void reload_tss(struct kvm_vcpu *vcpu) +{ + int cpu = raw_smp_processor_id(); + + struct svm_cpu_data *svm_data = per_cpu(svm_data, cpu); + svm_data->tss_desc->type = 9; //available 32/64-bit TSS + load_TR_desc(); +} + +static void pre_svm_run(struct kvm_vcpu *vcpu) +{ + int cpu = raw_smp_processor_id(); + + struct svm_cpu_data *svm_data = per_cpu(svm_data, cpu); + + vcpu->svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + if (vcpu->cpu != cpu || + vcpu->svm->asid_generation != svm_data->asid_generation) + new_asid(vcpu, svm_data); +} + + +static inline void kvm_try_inject_irq(struct kvm_vcpu *vcpu) +{ + struct vmcb_control_area *control; + + if (!vcpu->irq_summary) + return; + + control = &vcpu->svm->vmcb->control; + + control->int_vector = pop_irq(vcpu); + control->int_ctl &= ~V_INTR_PRIO_MASK; + control->int_ctl |= V_IRQ_MASK | + ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); +} + +static void kvm_reput_irq(struct kvm_vcpu *vcpu) +{ + struct vmcb_control_area *control = &vcpu->svm->vmcb->control; + + if (control->int_ctl & V_IRQ_MASK) { + control->int_ctl &= ~V_IRQ_MASK; + push_irq(vcpu, control->int_vector); + } +} + +static void save_db_regs(unsigned long *db_regs) +{ + asm ("mov %%dr0, %%rax \n\t" + "mov %%rax, %[dr0] \n\t" + "mov %%dr1, %%rax \n\t" + "mov %%rax, %[dr1] \n\t" + "mov %%dr2, %%rax \n\t" + "mov %%rax, %[dr2] \n\t" + "mov %%dr3, %%rax \n\t" + "mov %%rax, %[dr3] \n\t" + : [dr0] "=m"(db_regs[0]), + [dr1] "=m"(db_regs[1]), + [dr2] "=m"(db_regs[2]), + [dr3] "=m"(db_regs[3]) + : : "rax"); +} + +static void load_db_regs(unsigned long *db_regs) +{ + asm volatile ("mov %[dr0], %%dr0 \n\t" + "mov %[dr1], %%dr1 \n\t" + "mov %[dr2], %%dr2 \n\t" + "mov %[dr3], %%dr3 \n\t" + : + : [dr0] "r"(db_regs[0]), + [dr1] "r"(db_regs[1]), + [dr2] "r"(db_regs[2]), + [dr3] "r"(db_regs[3]) + : "rax"); +} + +static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u16 fs_selector; + u16 gs_selector; + u16 ldt_selector; + +again: + kvm_try_inject_irq(vcpu); + + clgi(); + + pre_svm_run(vcpu); + + save_host_msrs(vcpu); + fs_selector = read_fs(); + gs_selector = read_gs(); + ldt_selector = read_ldt(); + vcpu->svm->host_cr2 = kvm_read_cr2(); + vcpu->svm->host_dr6 = read_dr6(); + vcpu->svm->host_dr7 = read_dr7(); + vcpu->svm->vmcb->save.cr2 = vcpu->cr2; + + if (vcpu->svm->vmcb->save.dr7 & 0xff) { + write_dr7(0); + save_db_regs(vcpu->svm->host_db_regs); + load_db_regs(vcpu->svm->db_regs); + } + asm volatile ( +#ifdef __x86_64__ + "push %%rbx; push %%rcx; push %%rdx;" + "push %%rsi; push %%rdi; push %%rbp;" + "push %%r8; push %%r9; push %%r10; push %%r11;" + "push %%r12; push %%r13; push %%r14; push %%r15;" +#else + "push %%ebx; push %%ecx; push %%edx;" + "push %%esi; push %%edi; push %%ebp;" +#endif + +#ifdef __x86_64__ + "mov %c[rbx](%[vcpu]), %%rbx \n\t" + "mov %c[rcx](%[vcpu]), %%rcx \n\t" + "mov %c[rdx](%[vcpu]), %%rdx \n\t" + "mov %c[rsi](%[vcpu]), %%rsi \n\t" + "mov %c[rdi](%[vcpu]), %%rdi \n\t" + "mov %c[rbp](%[vcpu]), %%rbp \n\t" + "mov %c[r8](%[vcpu]), %%r8 \n\t" + "mov %c[r9](%[vcpu]), %%r9 \n\t" + "mov %c[r10](%[vcpu]), %%r10 \n\t" + "mov %c[r11](%[vcpu]), %%r11 \n\t" + "mov %c[r12](%[vcpu]), %%r12 \n\t" + "mov %c[r13](%[vcpu]), %%r13 \n\t" + "mov %c[r14](%[vcpu]), %%r14 \n\t" + "mov %c[r15](%[vcpu]), %%r15 \n\t" +#else + "mov %c[rbx](%[vcpu]), %%ebx \n\t" + "mov %c[rcx](%[vcpu]), %%ecx \n\t" + "mov %c[rdx](%[vcpu]), %%edx \n\t" + "mov %c[rsi](%[vcpu]), %%esi \n\t" + "mov %c[rdi](%[vcpu]), %%edi \n\t" + "mov %c[rbp](%[vcpu]), %%ebp \n\t" +#endif + + /* Enter guest mode */ + "push %%rax \n\t" + "mov %c[svm](%[vcpu]), %%rax \n\t" + "mov %c[vmcb](%%rax), %%rax \n\t" + SVM_VMLOAD "\n\t" + SVM_VMRUN "\n\t" + SVM_VMSAVE "\n\t" + "pop %%rax \n\t" + + /* Save guest registers, load host registers */ +#ifdef __x86_64__ + "mov %%rbx, %c[rbx](%[vcpu]) \n\t" + "mov %%rcx, %c[rcx](%[vcpu]) \n\t" + "mov %%rdx, %c[rdx](%[vcpu]) \n\t" + "mov %%rsi, %c[rsi](%[vcpu]) \n\t" + "mov %%rdi, %c[rdi](%[vcpu]) \n\t" + "mov %%rbp, %c[rbp](%[vcpu]) \n\t" + "mov %%r8, %c[r8](%[vcpu]) \n\t" + "mov %%r9, %c[r9](%[vcpu]) \n\t" + "mov %%r10, %c[r10](%[vcpu]) \n\t" + "mov %%r11, %c[r11](%[vcpu]) \n\t" + "mov %%r12, %c[r12](%[vcpu]) \n\t" + "mov %%r13, %c[r13](%[vcpu]) \n\t" + "mov %%r14, %c[r14](%[vcpu]) \n\t" + "mov %%r15, %c[r15](%[vcpu]) \n\t" + + "pop %%r15; pop %%r14; pop %%r13; pop %%r12;" + "pop %%r11; pop %%r10; pop %%r9; pop %%r8;" + "pop %%rbp; pop %%rdi; pop %%rsi;" + "pop %%rdx; pop %%rcx; pop %%rbx; \n\t" +#else + "mov %%ebx, %c[rbx](%3) \n\t" + "mov %%ecx, %c[rcx](%3) \n\t" + "mov %%edx, %c[rdx](%3) \n\t" + "mov %%esi, %c[rsi](%3) \n\t" + "mov %%edi, %c[rdi](%3) \n\t" + "mov %%ebp, %c[rbp](%3) \n\t" + + "pop %%ebp; pop %%edi; pop %%esi;" + "pop %%edx; pop %%ecx; pop %%ebx; \n\t" +#endif + : + : [vcpu]"a"(vcpu), + [svm]"i"(offsetof(struct kvm_vcpu, svm)), + [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)), + [rbx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBX])), + [rcx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RCX])), + [rdx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDX])), + [rsi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RSI])), + [rdi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDI])), + [rbp]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBP])) +#ifdef __x86_64__ + ,[r8 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R8 ])), + [r9 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R9 ])), + [r10]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R10])), + [r11]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R11])), + [r12]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R12])), + [r13]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R13])), + [r14]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R14])), + [r15]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R15])) +#endif + : "cc", "memory" ); + + if ((vcpu->svm->vmcb->save.dr7 & 0xff)) + load_db_regs(vcpu->svm->host_db_regs); + + vcpu->cr2 = vcpu->svm->vmcb->save.cr2; + + write_dr6(vcpu->svm->host_dr6); + write_dr7(vcpu->svm->host_dr7); + kvm_write_cr2(vcpu->svm->host_cr2); + + load_fs(fs_selector); + load_gs(gs_selector); + load_ldt(ldt_selector); + load_host_msrs(vcpu); + + reload_tss(vcpu); + + stgi(); + + kvm_reput_irq(vcpu); + + vcpu->svm->next_rip = 0; + + if (vcpu->svm->vmcb->control.exit_code == SVM_EXIT_ERR) { + kvm_run->exit_type = KVM_EXIT_TYPE_FAIL_ENTRY; + kvm_run->exit_reason = vcpu->svm->vmcb->control.exit_code; + return 0; + } + + if (handle_exit(vcpu, kvm_run)) { + if (signal_pending(current)) { + ++kvm_stat.signal_exits; + return -EINTR; + } + kvm_resched(vcpu); + goto again; + } + return 0; +} + +static void svm_flush_tlb(struct kvm_vcpu *vcpu) +{ + force_new_asid(vcpu); +} + +static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) +{ + vcpu->svm->vmcb->save.cr3 = root; + force_new_asid(vcpu); +} + +static void svm_inject_page_fault(struct kvm_vcpu *vcpu, + unsigned long addr, + uint32_t err_code) +{ + uint32_t exit_int_info = vcpu->svm->vmcb->control.exit_int_info; + + ++kvm_stat.pf_guest; + + if (is_page_fault(exit_int_info)) { + + vcpu->svm->vmcb->control.event_inj_err = 0; + vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_VALID_ERR | + SVM_EVTINJ_TYPE_EXEPT | + DF_VECTOR; + return; + } + vcpu->cr2 = addr; + vcpu->svm->vmcb->save.cr2 = addr; + vcpu->svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | + SVM_EVTINJ_VALID_ERR | + SVM_EVTINJ_TYPE_EXEPT | + PF_VECTOR; + vcpu->svm->vmcb->control.event_inj_err = err_code; +} + + +static int is_disabled(void) +{ + return 0; +} + +static struct kvm_arch_ops svm_arch_ops = { + .cpu_has_kvm_support = has_svm, + .disabled_by_bios = is_disabled, + .hardware_setup = svm_hardware_setup, + .hardware_unsetup = svm_hardware_unsetup, + .hardware_enable = svm_hardware_enable, + .hardware_disable = svm_hardware_disable, + + .vcpu_create = svm_create_vcpu, + .vcpu_free = svm_free_vcpu, + + .vcpu_load = svm_vcpu_load, + .vcpu_put = svm_vcpu_put, + + .set_guest_debug = svm_guest_debug, + .get_msr = svm_get_msr, + .set_msr = svm_set_msr, + .get_segment_base = svm_get_segment_base, + .get_segment = svm_get_segment, + .set_segment = svm_set_segment, + .is_long_mode = svm_is_long_mode, + .get_cs_db_l_bits = svm_get_cs_db_l_bits, + .set_cr0 = svm_set_cr0, + .set_cr0_no_modeswitch = svm_set_cr0, + .set_cr3 = svm_set_cr3, + .set_cr4 = svm_set_cr4, + .set_efer = svm_set_efer, + .get_idt = svm_get_idt, + .set_idt = svm_set_idt, + .get_gdt = svm_get_gdt, + .set_gdt = svm_set_gdt, + .get_dr = svm_get_dr, + .set_dr = svm_set_dr, + .cache_regs = svm_cache_regs, + .decache_regs = svm_decache_regs, + .get_rflags = svm_get_rflags, + .set_rflags = svm_set_rflags, + + .invlpg = svm_invlpg, + .tlb_flush = svm_flush_tlb, + .inject_page_fault = svm_inject_page_fault, + + .inject_gp = svm_inject_gp, + + .run = svm_vcpu_run, + .skip_emulated_instruction = skip_emulated_instruction, + .vcpu_setup = svm_vcpu_setup, +}; + +static int __init svm_init(void) +{ + int res = kvm_init_arch(&svm_arch_ops, THIS_MODULE); + + if (!res) + kvm_emulator_want_group7_invlpg(); + + return res; +} + +static void __exit svm_exit(void) +{ + kvm_exit_arch(); +} + +module_init(svm_init) +module_exit(svm_exit) Index: linux/drivers/kvm/svm.h =================================================================== --- /dev/null +++ linux/drivers/kvm/svm.h @@ -0,0 +1,315 @@ +#ifndef __SVM_H +#define __SVM_H + +enum { + INTERCEPT_INTR, + INTERCEPT_NMI, + INTERCEPT_SMI, + INTERCEPT_INIT, + INTERCEPT_VINTR, + INTERCEPT_SELECTIVE_CR0, + INTERCEPT_STORE_IDTR, + INTERCEPT_STORE_GDTR, + INTERCEPT_STORE_LDTR, + INTERCEPT_STORE_TR, + INTERCEPT_LOAD_IDTR, + INTERCEPT_LOAD_GDTR, + INTERCEPT_LOAD_LDTR, + INTERCEPT_LOAD_TR, + INTERCEPT_RDTSC, + INTERCEPT_RDPMC, + INTERCEPT_PUSHF, + INTERCEPT_POPF, + INTERCEPT_CPUID, + INTERCEPT_RSM, + INTERCEPT_IRET, + INTERCEPT_INTn, + INTERCEPT_INVD, + INTERCEPT_PAUSE, + INTERCEPT_HLT, + INTERCEPT_INVLPG, + INTERCEPT_INVLPGA, + INTERCEPT_IOIO_PROT, + INTERCEPT_MSR_PROT, + INTERCEPT_TASK_SWITCH, + INTERCEPT_FERR_FREEZE, + INTERCEPT_SHUTDOWN, + INTERCEPT_VMRUN, + INTERCEPT_VMMCALL, + INTERCEPT_VMLOAD, + INTERCEPT_VMSAVE, + INTERCEPT_STGI, + INTERCEPT_CLGI, + INTERCEPT_SKINIT, + INTERCEPT_RDTSCP, + INTERCEPT_ICEBP, + INTERCEPT_WBINVD, +}; + + +struct __attribute__ ((__packed__)) vmcb_control_area { + u16 intercept_cr_read; + u16 intercept_cr_write; + u16 intercept_dr_read; + u16 intercept_dr_write; + u32 intercept_exceptions; + u64 intercept; + u8 reserved_1[44]; + u64 iopm_base_pa; + u64 msrpm_base_pa; + u64 tsc_offset; + u32 asid; + u8 tlb_ctl; + u8 reserved_2[3]; + u32 int_ctl; + u32 int_vector; + u32 int_state; + u8 reserved_3[4]; + u32 exit_code; + u32 exit_code_hi; + u64 exit_info_1; + u64 exit_info_2; + u32 exit_int_info; + u32 exit_int_info_err; + u64 nested_ctl; + u8 reserved_4[16]; + u32 event_inj; + u32 event_inj_err; + u64 nested_cr3; + u64 lbr_ctl; + u8 reserved_5[832]; +}; + + +#define TLB_CONTROL_DO_NOTHING 0 +#define TLB_CONTROL_FLUSH_ALL_ASID 1 + +#define V_TPR_MASK 0x0f + +#define V_IRQ_SHIFT 8 +#define V_IRQ_MASK (1 << V_IRQ_SHIFT) + +#define V_INTR_PRIO_SHIFT 16 +#define V_INTR_PRIO_MASK (0x0f << V_INTR_PRIO_SHIFT) + +#define V_IGN_TPR_SHIFT 20 +#define V_IGN_TPR_MASK (1 << V_IGN_TPR_SHIFT) + +#define V_INTR_MASKING_SHIFT 24 +#define V_INTR_MASKING_MASK (1 << V_INTR_MASKING_SHIFT) + +#define SVM_INTERRUPT_SHADOW_MASK 1 + +#define SVM_IOIO_STR_SHIFT 2 +#define SVM_IOIO_REP_SHIFT 3 +#define SVM_IOIO_SIZE_SHIFT 4 +#define SVM_IOIO_ASIZE_SHIFT 7 + +#define SVM_IOIO_TYPE_MASK 1 +#define SVM_IOIO_STR_MASK (1 << SVM_IOIO_STR_SHIFT) +#define SVM_IOIO_REP_MASK (1 << SVM_IOIO_REP_SHIFT) +#define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT) +#define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT) + +struct __attribute__ ((__packed__)) vmcb_seg { + u16 selector; + u16 attrib; + u32 limit; + u64 base; +}; + +struct __attribute__ ((__packed__)) vmcb_save_area { + struct vmcb_seg es; + struct vmcb_seg cs; + struct vmcb_seg ss; + struct vmcb_seg ds; + struct vmcb_seg fs; + struct vmcb_seg gs; + struct vmcb_seg gdtr; + struct vmcb_seg ldtr; + struct vmcb_seg idtr; + struct vmcb_seg tr; + u8 reserved_1[43]; + u8 cpl; + u8 reserved_2[4]; + u64 efer; + u8 reserved_3[112]; + u64 cr4; + u64 cr3; + u64 cr0; + u64 dr7; + u64 dr6; + u64 rflags; + u64 rip; + u8 reserved_4[88]; + u64 rsp; + u8 reserved_5[24]; + u64 rax; + u64 star; + u64 lstar; + u64 cstar; + u64 sfmask; + u64 kernel_gs_base; + u64 sysenter_cs; + u64 sysenter_esp; + u64 sysenter_eip; + u64 cr2; + u8 reserved_6[32]; + u64 g_pat; + u64 dbgctl; + u64 br_from; + u64 br_to; + u64 last_excp_from; + u64 last_excp_to; +}; + +struct __attribute__ ((__packed__)) vmcb { + struct vmcb_control_area control; + struct vmcb_save_area save; +}; + +#define SVM_CPUID_FEATURE_SHIFT 2 +#define SVM_CPUID_FUNC 0x8000000a + +#define MSR_EFER_SVME_MASK (1ULL << 12) +#define MSR_VM_HSAVE_PA 0xc0010117ULL + +#define SVM_SELECTOR_S_SHIFT 4 +#define SVM_SELECTOR_DPL_SHIFT 5 +#define SVM_SELECTOR_P_SHIFT 7 +#define SVM_SELECTOR_AVL_SHIFT 8 +#define SVM_SELECTOR_L_SHIFT 9 +#define SVM_SELECTOR_DB_SHIFT 10 +#define SVM_SELECTOR_G_SHIFT 11 + +#define SVM_SELECTOR_TYPE_MASK (0xf) +#define SVM_SELECTOR_S_MASK (1 << SVM_SELECTOR_S_SHIFT) +#define SVM_SELECTOR_DPL_MASK (3 << SVM_SELECTOR_DPL_SHIFT) +#define SVM_SELECTOR_P_MASK (1 << SVM_SELECTOR_P_SHIFT) +#define SVM_SELECTOR_AVL_MASK (1 << SVM_SELECTOR_AVL_SHIFT) +#define SVM_SELECTOR_L_MASK (1 << SVM_SELECTOR_L_SHIFT) +#define SVM_SELECTOR_DB_MASK (1 << SVM_SELECTOR_DB_SHIFT) +#define SVM_SELECTOR_G_MASK (1 << SVM_SELECTOR_G_SHIFT) + +#define SVM_SELECTOR_WRITE_MASK (1 << 1) +#define SVM_SELECTOR_READ_MASK SVM_SELECTOR_WRITE_MASK +#define SVM_SELECTOR_CODE_MASK (1 << 3) + +#define INTERCEPT_CR0_MASK 1 +#define INTERCEPT_CR3_MASK (1 << 3) +#define INTERCEPT_CR4_MASK (1 << 4) + +#define INTERCEPT_DR0_MASK 1 +#define INTERCEPT_DR1_MASK (1 << 1) +#define INTERCEPT_DR2_MASK (1 << 2) +#define INTERCEPT_DR3_MASK (1 << 3) +#define INTERCEPT_DR4_MASK (1 << 4) +#define INTERCEPT_DR5_MASK (1 << 5) +#define INTERCEPT_DR6_MASK (1 << 6) +#define INTERCEPT_DR7_MASK (1 << 7) + +#define SVM_EVTINJ_VEC_MASK 0xff + +#define SVM_EVTINJ_TYPE_SHIFT 8 +#define SVM_EVTINJ_TYPE_MASK (7 << SVM_EVTINJ_TYPE_SHIFT) + +#define SVM_EVTINJ_TYPE_INTR (0 << SVM_EVTINJ_TYPE_SHIFT) +#define SVM_EVTINJ_TYPE_NMI (2 << SVM_EVTINJ_TYPE_SHIFT) +#define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT) +#define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT) + +#define SVM_EVTINJ_VALID (1 << 31) +#define SVM_EVTINJ_VALID_ERR (1 << 11) + +#define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK + +#define SVM_EXITINTINFO_TYPE_INTR SVM_EVTINJ_TYPE_INTR +#define SVM_EXITINTINFO_TYPE_NMI SVM_EVTINJ_TYPE_NMI +#define SVM_EXITINTINFO_TYPE_EXEPT SVM_EVTINJ_TYPE_EXEPT +#define SVM_EXITINTINFO_TYPE_SOFT SVM_EVTINJ_TYPE_SOFT + +#define SVM_EXITINTINFO_VALID SVM_EVTINJ_VALID +#define SVM_EXITINTINFO_VALID_ERR SVM_EVTINJ_VALID_ERR + +#define SVM_EXIT_READ_CR0 0x000 +#define SVM_EXIT_READ_CR3 0x003 +#define SVM_EXIT_READ_CR4 0x004 +#define SVM_EXIT_READ_CR8 0x008 +#define SVM_EXIT_WRITE_CR0 0x010 +#define SVM_EXIT_WRITE_CR3 0x013 +#define SVM_EXIT_WRITE_CR4 0x014 +#define SVM_EXIT_WRITE_CR8 0x018 +#define SVM_EXIT_READ_DR0 0x020 +#define SVM_EXIT_READ_DR1 0x021 +#define SVM_EXIT_READ_DR2 0x022 +#define SVM_EXIT_READ_DR3 0x023 +#define SVM_EXIT_READ_DR4 0x024 +#define SVM_EXIT_READ_DR5 0x025 +#define SVM_EXIT_READ_DR6 0x026 +#define SVM_EXIT_READ_DR7 0x027 +#define SVM_EXIT_WRITE_DR0 0x030 +#define SVM_EXIT_WRITE_DR1 0x031 +#define SVM_EXIT_WRITE_DR2 0x032 +#define SVM_EXIT_WRITE_DR3 0x033 +#define SVM_EXIT_WRITE_DR4 0x034 +#define SVM_EXIT_WRITE_DR5 0x035 +#define SVM_EXIT_WRITE_DR6 0x036 +#define SVM_EXIT_WRITE_DR7 0x037 +#define SVM_EXIT_EXCP_BASE 0x040 +#define SVM_EXIT_INTR 0x060 +#define SVM_EXIT_NMI 0x061 +#define SVM_EXIT_SMI 0x062 +#define SVM_EXIT_INIT 0x063 +#define SVM_EXIT_VINTR 0x064 +#define SVM_EXIT_CR0_SEL_WRITE 0x065 +#define SVM_EXIT_IDTR_READ 0x066 +#define SVM_EXIT_GDTR_READ 0x067 +#define SVM_EXIT_LDTR_READ 0x068 +#define SVM_EXIT_TR_READ 0x069 +#define SVM_EXIT_IDTR_WRITE 0x06a +#define SVM_EXIT_GDTR_WRITE 0x06b +#define SVM_EXIT_LDTR_WRITE 0x06c +#define SVM_EXIT_TR_WRITE 0x06d +#define SVM_EXIT_RDTSC 0x06e +#define SVM_EXIT_RDPMC 0x06f +#define SVM_EXIT_PUSHF 0x070 +#define SVM_EXIT_POPF 0x071 +#define SVM_EXIT_CPUID 0x072 +#define SVM_EXIT_RSM 0x073 +#define SVM_EXIT_IRET 0x074 +#define SVM_EXIT_SWINT 0x075 +#define SVM_EXIT_INVD 0x076 +#define SVM_EXIT_PAUSE 0x077 +#define SVM_EXIT_HLT 0x078 +#define SVM_EXIT_INVLPG 0x079 +#define SVM_EXIT_INVLPGA 0x07a +#define SVM_EXIT_IOIO 0x07b +#define SVM_EXIT_MSR 0x07c +#define SVM_EXIT_TASK_SWITCH 0x07d +#define SVM_EXIT_FERR_FREEZE 0x07e +#define SVM_EXIT_SHUTDOWN 0x07f +#define SVM_EXIT_VMRUN 0x080 +#define SVM_EXIT_VMMCALL 0x081 +#define SVM_EXIT_VMLOAD 0x082 +#define SVM_EXIT_VMSAVE 0x083 +#define SVM_EXIT_STGI 0x084 +#define SVM_EXIT_CLGI 0x085 +#define SVM_EXIT_SKINIT 0x086 +#define SVM_EXIT_RDTSCP 0x087 +#define SVM_EXIT_ICEBP 0x088 +#define SVM_EXIT_WBINVD 0x089 +#define SVM_EXIT_NPF 0x400 + +#define SVM_EXIT_ERR -1 + +#define SVM_CR0_SELECTIVE_MASK (1 << 3 | 1) // TS and MP + +#define SVM_VMLOAD ".byte 0x0f, 0x01, 0xda" +#define SVM_VMRUN ".byte 0x0f, 0x01, 0xd8" +#define SVM_VMSAVE ".byte 0x0f, 0x01, 0xdb" +#define SVM_CLGI ".byte 0x0f, 0x01, 0xdd" +#define SVM_STGI ".byte 0x0f, 0x01, 0xdc" +#define SVM_INVLPGA ".byte 0x0f, 0x01, 0xdf" + +#endif + Index: linux/drivers/kvm/vmx.c =================================================================== --- /dev/null +++ linux/drivers/kvm/vmx.c @@ -0,0 +1,2008 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables machines with Intel VT-x extensions to run virtual + * machines without emulation or binary translation. + * + * Copyright (C) 2006 Qumranet, Inc. + * + * Authors: + * Avi Kivity + * Yaniv Kamay + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include "kvm.h" +#include "vmx.h" +#include "kvm_vmx.h" +#include +#include +#include +#include + +#include "segment_descriptor.h" + +#define MSR_IA32_FEATURE_CONTROL 0x03a + +MODULE_AUTHOR("Qumranet"); +MODULE_LICENSE("GPL"); + +static DEFINE_PER_CPU(struct vmcs *, vmxarea); +static DEFINE_PER_CPU(struct vmcs *, current_vmcs); + +#ifdef __x86_64__ +#define HOST_IS_64 1 +#else +#define HOST_IS_64 0 +#endif + +static struct vmcs_descriptor { + int size; + int order; + u32 revision_id; +} vmcs_descriptor; + +#define VMX_SEGMENT_FIELD(seg) \ + [VCPU_SREG_##seg] { \ + GUEST_##seg##_SELECTOR, \ + GUEST_##seg##_BASE, \ + GUEST_##seg##_LIMIT, \ + GUEST_##seg##_AR_BYTES, \ + } + +static struct kvm_vmx_segment_field { + unsigned selector; + unsigned base; + unsigned limit; + unsigned ar_bytes; +} kvm_vmx_segment_fields[] = { + VMX_SEGMENT_FIELD(CS), + VMX_SEGMENT_FIELD(DS), + VMX_SEGMENT_FIELD(ES), + VMX_SEGMENT_FIELD(FS), + VMX_SEGMENT_FIELD(GS), + VMX_SEGMENT_FIELD(SS), + VMX_SEGMENT_FIELD(TR), + VMX_SEGMENT_FIELD(LDTR), +}; + +static const u32 vmx_msr_index[] = { +#ifdef __x86_64__ + MSR_SYSCALL_MASK, MSR_LSTAR, MSR_CSTAR, MSR_KERNEL_GS_BASE, +#endif + MSR_EFER, MSR_K6_STAR, +}; +#define NR_VMX_MSR (sizeof(vmx_msr_index) / sizeof(*vmx_msr_index)) + +static inline int is_page_fault(u32 intr_info) +{ + return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | + INTR_INFO_VALID_MASK)) == + (INTR_TYPE_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK); +} + +static inline int is_external_interrupt(u32 intr_info) +{ + return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) + == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); +} + +static struct vmx_msr_entry *find_msr_entry(struct kvm_vcpu *vcpu, u32 msr) +{ + int i; + + for (i = 0; i < vcpu->nmsrs; ++i) + if (vcpu->guest_msrs[i].index == msr) + return &vcpu->guest_msrs[i]; + return 0; +} + +static void vmcs_clear(struct vmcs *vmcs) +{ + u64 phys_addr = __pa(vmcs); + u8 error; + + asm volatile (ASM_VMX_VMCLEAR_RAX "; setna %0" + : "=g"(error) : "a"(&phys_addr), "m"(phys_addr) + : "cc", "memory"); + if (error) + printk(KERN_ERR "kvm: vmclear fail: %p/%llx\n", + vmcs, phys_addr); +} + +static void __vcpu_clear(void *arg) +{ + struct kvm_vcpu *vcpu = arg; + int cpu = raw_smp_processor_id(); + + if (vcpu->cpu == cpu) + vmcs_clear(vcpu->vmcs); + if (per_cpu(current_vmcs, cpu) == vcpu->vmcs) + per_cpu(current_vmcs, cpu) = 0; +} + +static unsigned long vmcs_readl(unsigned long field) +{ + unsigned long value; + + asm volatile (ASM_VMX_VMREAD_RDX_RAX + : "=a"(value) : "d"(field) : "cc"); + return value; +} + +static u16 vmcs_read16(unsigned long field) +{ + return vmcs_readl(field); +} + +static u32 vmcs_read32(unsigned long field) +{ + return vmcs_readl(field); +} + +static u64 vmcs_read64(unsigned long field) +{ +#ifdef __x86_64__ + return vmcs_readl(field); +#else + return vmcs_readl(field) | ((u64)vmcs_readl(field+1) << 32); +#endif +} + +static void vmcs_writel(unsigned long field, unsigned long value) +{ + u8 error; + + asm volatile (ASM_VMX_VMWRITE_RAX_RDX "; setna %0" + : "=q"(error) : "a"(value), "d"(field) : "cc" ); + if (error) + printk(KERN_ERR "vmwrite error: reg %lx value %lx (err %d)\n", + field, value, vmcs_read32(VM_INSTRUCTION_ERROR)); +} + +static void vmcs_write16(unsigned long field, u16 value) +{ + vmcs_writel(field, value); +} + +static void vmcs_write32(unsigned long field, u32 value) +{ + vmcs_writel(field, value); +} + +static void vmcs_write64(unsigned long field, u64 value) +{ +#ifdef __x86_64__ + vmcs_writel(field, value); +#else + vmcs_writel(field, value); + asm volatile (""); + vmcs_writel(field+1, value >> 32); +#endif +} + +/* + * Switches to specified vcpu, until a matching vcpu_put(), but assumes + * vcpu mutex is already taken. + */ +static struct kvm_vcpu *vmx_vcpu_load(struct kvm_vcpu *vcpu) +{ + u64 phys_addr = __pa(vcpu->vmcs); + int cpu; + + cpu = get_cpu(); + + if (vcpu->cpu != cpu) { + smp_call_function(__vcpu_clear, vcpu, 0, 1); + vcpu->launched = 0; + } + + if (per_cpu(current_vmcs, cpu) != vcpu->vmcs) { + u8 error; + + per_cpu(current_vmcs, cpu) = vcpu->vmcs; + asm volatile (ASM_VMX_VMPTRLD_RAX "; setna %0" + : "=g"(error) : "a"(&phys_addr), "m"(phys_addr) + : "cc"); + if (error) + printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n", + vcpu->vmcs, phys_addr); + } + + if (vcpu->cpu != cpu) { + struct descriptor_table dt; + unsigned long sysenter_esp; + + vcpu->cpu = cpu; + /* + * Linux uses per-cpu TSS and GDT, so set these when switching + * processors. + */ + vmcs_writel(HOST_TR_BASE, read_tr_base()); /* 22.2.4 */ + get_gdt(&dt); + vmcs_writel(HOST_GDTR_BASE, dt.base); /* 22.2.4 */ + + rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp); + vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */ + } + return vcpu; +} + +static void vmx_vcpu_put(struct kvm_vcpu *vcpu) +{ + put_cpu(); +} + +static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) +{ + return vmcs_readl(GUEST_RFLAGS); +} + +static void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) +{ + vmcs_writel(GUEST_RFLAGS, rflags); +} + +static void skip_emulated_instruction(struct kvm_vcpu *vcpu) +{ + unsigned long rip; + u32 interruptibility; + + rip = vmcs_readl(GUEST_RIP); + rip += vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + vmcs_writel(GUEST_RIP, rip); + + /* + * We emulated an instruction, so temporary interrupt blocking + * should be removed, if set. + */ + interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); + if (interruptibility & 3) + vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, + interruptibility & ~3); +} + +static void vmx_inject_gp(struct kvm_vcpu *vcpu, unsigned error_code) +{ + printk(KERN_DEBUG "inject_general_protection: rip 0x%lx\n", + vmcs_readl(GUEST_RIP)); + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code); + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + GP_VECTOR | + INTR_TYPE_EXCEPTION | + INTR_INFO_DELIEVER_CODE_MASK | + INTR_INFO_VALID_MASK); +} + +/* + * reads and returns guest's timestamp counter "register" + * guest_tsc = host_tsc + tsc_offset -- 21.3 + */ +static u64 guest_read_tsc(void) +{ + u64 host_tsc, tsc_offset; + + rdtscll(host_tsc); + tsc_offset = vmcs_read64(TSC_OFFSET); + return host_tsc + tsc_offset; +} + +/* + * writes 'guest_tsc' into guest's timestamp counter "register" + * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc + */ +static void guest_write_tsc(u64 guest_tsc) +{ + u64 host_tsc; + + rdtscll(host_tsc); + vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc); +} + +static void reload_tss(void) +{ +#ifndef __x86_64__ + + /* + * VT restores TR but not its size. Useless. + */ + struct descriptor_table gdt; + struct segment_descriptor *descs; + + get_gdt(&gdt); + descs = (void *)gdt.base; + descs[GDT_ENTRY_TSS].type = 9; /* available TSS */ + load_TR_desc(); +#endif +} + +/* + * Reads an msr value (of 'msr_index') into 'pdata'. + * Returns 0 on success, non-0 otherwise. + * Assumes vcpu_load() was already called. + */ +static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) +{ + u64 data; + struct vmx_msr_entry *msr; + + if (!pdata) { + printk(KERN_ERR "BUG: get_msr called with NULL pdata\n"); + return -EINVAL; + } + + switch (msr_index) { +#ifdef __x86_64__ + case MSR_FS_BASE: + data = vmcs_readl(GUEST_FS_BASE); + break; + case MSR_GS_BASE: + data = vmcs_readl(GUEST_GS_BASE); + break; + case MSR_EFER: + data = vcpu->shadow_efer; + break; +#endif + case MSR_IA32_TIME_STAMP_COUNTER: + data = guest_read_tsc(); + break; + case MSR_IA32_SYSENTER_CS: + data = vmcs_read32(GUEST_SYSENTER_CS); + break; + case MSR_IA32_SYSENTER_EIP: + data = vmcs_read32(GUEST_SYSENTER_EIP); + break; + case MSR_IA32_SYSENTER_ESP: + data = vmcs_read32(GUEST_SYSENTER_ESP); + break; + case MSR_IA32_MC0_CTL: + case MSR_IA32_MCG_STATUS: + case MSR_IA32_MCG_CAP: + case MSR_IA32_MC0_MISC: + case MSR_IA32_MC0_MISC+4: + case MSR_IA32_MC0_MISC+8: + case MSR_IA32_MC0_MISC+12: + case MSR_IA32_MC0_MISC+16: + case MSR_IA32_UCODE_REV: + /* MTRR registers */ + case 0xfe: + case 0x200 ... 0x2ff: + data = 0; + break; + case MSR_IA32_APICBASE: + data = vcpu->apic_base; + break; + default: + msr = find_msr_entry(vcpu, msr_index); + if (!msr) { + printk(KERN_ERR "kvm: unhandled rdmsr: %x\n", msr_index); + return 1; + } + data = msr->data; + break; + } + + *pdata = data; + return 0; +} + +/* + * Writes msr value into into the appropriate "register". + * Returns 0 on success, non-0 otherwise. + * Assumes vcpu_load() was already called. + */ +static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) +{ + struct vmx_msr_entry *msr; + switch (msr_index) { +#ifdef __x86_64__ + case MSR_FS_BASE: + vmcs_writel(GUEST_FS_BASE, data); + break; + case MSR_GS_BASE: + vmcs_writel(GUEST_GS_BASE, data); + break; +#endif + case MSR_IA32_SYSENTER_CS: + vmcs_write32(GUEST_SYSENTER_CS, data); + break; + case MSR_IA32_SYSENTER_EIP: + vmcs_write32(GUEST_SYSENTER_EIP, data); + break; + case MSR_IA32_SYSENTER_ESP: + vmcs_write32(GUEST_SYSENTER_ESP, data); + break; +#ifdef __x86_64 + case MSR_EFER: + set_efer(vcpu, data); + break; + case MSR_IA32_MC0_STATUS: + printk(KERN_WARNING "%s: MSR_IA32_MC0_STATUS 0x%llx, nop\n" + , __FUNCTION__, data); + break; +#endif + case MSR_IA32_TIME_STAMP_COUNTER: { + guest_write_tsc(data); + break; + } + case MSR_IA32_UCODE_REV: + case MSR_IA32_UCODE_WRITE: + case 0x200 ... 0x2ff: /* MTRRs */ + break; + case MSR_IA32_APICBASE: + vcpu->apic_base = data; + break; + default: + msr = find_msr_entry(vcpu, msr_index); + if (!msr) { + printk(KERN_ERR "kvm: unhandled wrmsr: 0x%x\n", msr_index); + return 1; + } + msr->data = data; + break; + } + + return 0; +} + +/* + * Sync the rsp and rip registers into the vcpu structure. This allows + * registers to be accessed by indexing vcpu->regs. + */ +static void vcpu_load_rsp_rip(struct kvm_vcpu *vcpu) +{ + vcpu->regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP); + vcpu->rip = vmcs_readl(GUEST_RIP); +} + +/* + * Syncs rsp and rip back into the vmcs. Should be called after possible + * modification. + */ +static void vcpu_put_rsp_rip(struct kvm_vcpu *vcpu) +{ + vmcs_writel(GUEST_RSP, vcpu->regs[VCPU_REGS_RSP]); + vmcs_writel(GUEST_RIP, vcpu->rip); +} + +static int set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_debug_guest *dbg) +{ + unsigned long dr7 = 0x400; + u32 exception_bitmap; + int old_singlestep; + + exception_bitmap = vmcs_read32(EXCEPTION_BITMAP); + old_singlestep = vcpu->guest_debug.singlestep; + + vcpu->guest_debug.enabled = dbg->enabled; + if (vcpu->guest_debug.enabled) { + int i; + + dr7 |= 0x200; /* exact */ + for (i = 0; i < 4; ++i) { + if (!dbg->breakpoints[i].enabled) + continue; + vcpu->guest_debug.bp[i] = dbg->breakpoints[i].address; + dr7 |= 2 << (i*2); /* global enable */ + dr7 |= 0 << (i*4+16); /* execution breakpoint */ + } + + exception_bitmap |= (1u << 1); /* Trap debug exceptions */ + + vcpu->guest_debug.singlestep = dbg->singlestep; + } else { + exception_bitmap &= ~(1u << 1); /* Ignore debug exceptions */ + vcpu->guest_debug.singlestep = 0; + } + + if (old_singlestep && !vcpu->guest_debug.singlestep) { + unsigned long flags; + + flags = vmcs_readl(GUEST_RFLAGS); + flags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF); + vmcs_writel(GUEST_RFLAGS, flags); + } + + vmcs_write32(EXCEPTION_BITMAP, exception_bitmap); + vmcs_writel(GUEST_DR7, dr7); + + return 0; +} + +static __init int cpu_has_kvm_support(void) +{ + unsigned long ecx = cpuid_ecx(1); + return test_bit(5, &ecx); /* CPUID.1:ECX.VMX[bit 5] -> VT */ +} + +static __init int vmx_disabled_by_bios(void) +{ + u64 msr; + + rdmsrl(MSR_IA32_FEATURE_CONTROL, msr); + return (msr & 5) == 1; /* locked but not enabled */ +} + +static __init void hardware_enable(void *garbage) +{ + int cpu = raw_smp_processor_id(); + u64 phys_addr = __pa(per_cpu(vmxarea, cpu)); + u64 old; + + rdmsrl(MSR_IA32_FEATURE_CONTROL, old); + if ((old & 5) == 0) + /* enable and lock */ + wrmsrl(MSR_IA32_FEATURE_CONTROL, old | 5); + write_cr4(read_cr4() | CR4_VMXE); /* FIXME: not cpu hotplug safe */ + asm volatile (ASM_VMX_VMXON_RAX : : "a"(&phys_addr), "m"(phys_addr) + : "memory", "cc"); +} + +static void hardware_disable(void *garbage) +{ + asm volatile (ASM_VMX_VMXOFF : : : "cc"); +} + +static __init void setup_vmcs_descriptor(void) +{ + u32 vmx_msr_low, vmx_msr_high; + + rdmsr(MSR_IA32_VMX_BASIC_MSR, vmx_msr_low, vmx_msr_high); + vmcs_descriptor.size = vmx_msr_high & 0x1fff; + vmcs_descriptor.order = get_order(vmcs_descriptor.size); + vmcs_descriptor.revision_id = vmx_msr_low; +}; + +static struct vmcs *alloc_vmcs_cpu(int cpu) +{ + int node = cpu_to_node(cpu); + struct page *pages; + struct vmcs *vmcs; + + pages = alloc_pages_node(node, GFP_KERNEL, vmcs_descriptor.order); + if (!pages) + return 0; + vmcs = page_address(pages); + memset(vmcs, 0, vmcs_descriptor.size); + vmcs->revision_id = vmcs_descriptor.revision_id; /* vmcs revision id */ + return vmcs; +} + +static struct vmcs *alloc_vmcs(void) +{ + return alloc_vmcs_cpu(raw_smp_processor_id()); +} + +static void free_vmcs(struct vmcs *vmcs) +{ + free_pages((unsigned long)vmcs, vmcs_descriptor.order); +} + +static __exit void free_kvm_area(void) +{ + int cpu; + + for_each_online_cpu(cpu) + free_vmcs(per_cpu(vmxarea, cpu)); +} + +extern struct vmcs *alloc_vmcs_cpu(int cpu); + +static __init int alloc_kvm_area(void) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct vmcs *vmcs; + + vmcs = alloc_vmcs_cpu(cpu); + if (!vmcs) { + free_kvm_area(); + return -ENOMEM; + } + + per_cpu(vmxarea, cpu) = vmcs; + } + return 0; +} + +static __init int hardware_setup(void) +{ + setup_vmcs_descriptor(); + return alloc_kvm_area(); +} + +static __exit void hardware_unsetup(void) +{ + free_kvm_area(); +} + +static void update_exception_bitmap(struct kvm_vcpu *vcpu) +{ + if (vcpu->rmode.active) + vmcs_write32(EXCEPTION_BITMAP, ~0); + else + vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR); +} + +static void fix_pmode_dataseg(int seg, struct kvm_save_segment *save) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + + if (vmcs_readl(sf->base) == save->base) { + vmcs_write16(sf->selector, save->selector); + vmcs_writel(sf->base, save->base); + vmcs_write32(sf->limit, save->limit); + vmcs_write32(sf->ar_bytes, save->ar); + } else { + u32 dpl = (vmcs_read16(sf->selector) & SELECTOR_RPL_MASK) + << AR_DPL_SHIFT; + vmcs_write32(sf->ar_bytes, 0x93 | dpl); + } +} + +static void enter_pmode(struct kvm_vcpu *vcpu) +{ + unsigned long flags; + + vcpu->rmode.active = 0; + + vmcs_writel(GUEST_TR_BASE, vcpu->rmode.tr.base); + vmcs_write32(GUEST_TR_LIMIT, vcpu->rmode.tr.limit); + vmcs_write32(GUEST_TR_AR_BYTES, vcpu->rmode.tr.ar); + + flags = vmcs_readl(GUEST_RFLAGS); + flags &= ~(IOPL_MASK | X86_EFLAGS_VM); + flags |= (vcpu->rmode.save_iopl << IOPL_SHIFT); + vmcs_writel(GUEST_RFLAGS, flags); + + vmcs_writel(GUEST_CR4, (vmcs_readl(GUEST_CR4) & ~CR4_VME_MASK) | + (vmcs_readl(CR4_READ_SHADOW) & CR4_VME_MASK)); + + update_exception_bitmap(vcpu); + + fix_pmode_dataseg(VCPU_SREG_ES, &vcpu->rmode.es); + fix_pmode_dataseg(VCPU_SREG_DS, &vcpu->rmode.ds); + fix_pmode_dataseg(VCPU_SREG_GS, &vcpu->rmode.gs); + fix_pmode_dataseg(VCPU_SREG_FS, &vcpu->rmode.fs); + + vmcs_write16(GUEST_SS_SELECTOR, 0); + vmcs_write32(GUEST_SS_AR_BYTES, 0x93); + + vmcs_write16(GUEST_CS_SELECTOR, + vmcs_read16(GUEST_CS_SELECTOR) & ~SELECTOR_RPL_MASK); + vmcs_write32(GUEST_CS_AR_BYTES, 0x9b); +} + +static int rmode_tss_base(struct kvm* kvm) +{ + gfn_t base_gfn = kvm->memslots[0].base_gfn + kvm->memslots[0].npages - 3; + return base_gfn << PAGE_SHIFT; +} + +static void fix_rmode_seg(int seg, struct kvm_save_segment *save) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + + save->selector = vmcs_read16(sf->selector); + save->base = vmcs_readl(sf->base); + save->limit = vmcs_read32(sf->limit); + save->ar = vmcs_read32(sf->ar_bytes); + vmcs_write16(sf->selector, vmcs_readl(sf->base) >> 4); + vmcs_write32(sf->limit, 0xffff); + vmcs_write32(sf->ar_bytes, 0xf3); +} + +static void enter_rmode(struct kvm_vcpu *vcpu) +{ + unsigned long flags; + + vcpu->rmode.active = 1; + + vcpu->rmode.tr.base = vmcs_readl(GUEST_TR_BASE); + vmcs_writel(GUEST_TR_BASE, rmode_tss_base(vcpu->kvm)); + + vcpu->rmode.tr.limit = vmcs_read32(GUEST_TR_LIMIT); + vmcs_write32(GUEST_TR_LIMIT, RMODE_TSS_SIZE - 1); + + vcpu->rmode.tr.ar = vmcs_read32(GUEST_TR_AR_BYTES); + vmcs_write32(GUEST_TR_AR_BYTES, 0x008b); + + flags = vmcs_readl(GUEST_RFLAGS); + vcpu->rmode.save_iopl = (flags & IOPL_MASK) >> IOPL_SHIFT; + + flags |= IOPL_MASK | X86_EFLAGS_VM; + + vmcs_writel(GUEST_RFLAGS, flags); + vmcs_writel(GUEST_CR4, vmcs_readl(GUEST_CR4) | CR4_VME_MASK); + update_exception_bitmap(vcpu); + + vmcs_write16(GUEST_SS_SELECTOR, vmcs_readl(GUEST_SS_BASE) >> 4); + vmcs_write32(GUEST_SS_LIMIT, 0xffff); + vmcs_write32(GUEST_SS_AR_BYTES, 0xf3); + + vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); + vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE) >> 4); + + fix_rmode_seg(VCPU_SREG_ES, &vcpu->rmode.es); + fix_rmode_seg(VCPU_SREG_DS, &vcpu->rmode.ds); + fix_rmode_seg(VCPU_SREG_GS, &vcpu->rmode.gs); + fix_rmode_seg(VCPU_SREG_FS, &vcpu->rmode.fs); +} + +#ifdef __x86_64__ + +static void enter_lmode(struct kvm_vcpu *vcpu) +{ + u32 guest_tr_ar; + + guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES); + if ((guest_tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) { + printk(KERN_DEBUG "%s: tss fixup for long mode. \n", + __FUNCTION__); + vmcs_write32(GUEST_TR_AR_BYTES, + (guest_tr_ar & ~AR_TYPE_MASK) + | AR_TYPE_BUSY_64_TSS); + } + + vcpu->shadow_efer |= EFER_LMA; + + find_msr_entry(vcpu, MSR_EFER)->data |= EFER_LMA | EFER_LME; + vmcs_write32(VM_ENTRY_CONTROLS, + vmcs_read32(VM_ENTRY_CONTROLS) + | VM_ENTRY_CONTROLS_IA32E_MASK); +} + +static void exit_lmode(struct kvm_vcpu *vcpu) +{ + vcpu->shadow_efer &= ~EFER_LMA; + + vmcs_write32(VM_ENTRY_CONTROLS, + vmcs_read32(VM_ENTRY_CONTROLS) + & ~VM_ENTRY_CONTROLS_IA32E_MASK); +} + +#endif + +static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) +{ + if (vcpu->rmode.active && (cr0 & CR0_PE_MASK)) + enter_pmode(vcpu); + + if (!vcpu->rmode.active && !(cr0 & CR0_PE_MASK)) + enter_rmode(vcpu); + +#ifdef __x86_64__ + if (vcpu->shadow_efer & EFER_LME) { + if (!is_paging(vcpu) && (cr0 & CR0_PG_MASK)) + enter_lmode(vcpu); + if (is_paging(vcpu) && !(cr0 & CR0_PG_MASK)) + exit_lmode(vcpu); + } +#endif + + vmcs_writel(CR0_READ_SHADOW, cr0); + vmcs_writel(GUEST_CR0, + (cr0 & ~KVM_GUEST_CR0_MASK) | KVM_VM_CR0_ALWAYS_ON); + vcpu->cr0 = cr0; +} + +/* + * Used when restoring the VM to avoid corrupting segment registers + */ +static void vmx_set_cr0_no_modeswitch(struct kvm_vcpu *vcpu, unsigned long cr0) +{ + vcpu->rmode.active = ((cr0 & CR0_PE_MASK) == 0); + update_exception_bitmap(vcpu); + vmcs_writel(CR0_READ_SHADOW, cr0); + vmcs_writel(GUEST_CR0, + (cr0 & ~KVM_GUEST_CR0_MASK) | KVM_VM_CR0_ALWAYS_ON); + vcpu->cr0 = cr0; +} + +static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) +{ + vmcs_writel(GUEST_CR3, cr3); +} + +static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +{ + vmcs_writel(CR4_READ_SHADOW, cr4); + vmcs_writel(GUEST_CR4, cr4 | (vcpu->rmode.active ? + KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON)); + vcpu->cr4 = cr4; +} + +#ifdef __x86_64__ + +static void vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer) +{ + struct vmx_msr_entry *msr = find_msr_entry(vcpu, MSR_EFER); + + vcpu->shadow_efer = efer; + if (efer & EFER_LMA) { + vmcs_write32(VM_ENTRY_CONTROLS, + vmcs_read32(VM_ENTRY_CONTROLS) | + VM_ENTRY_CONTROLS_IA32E_MASK); + msr->data = efer; + + } else { + vmcs_write32(VM_ENTRY_CONTROLS, + vmcs_read32(VM_ENTRY_CONTROLS) & + ~VM_ENTRY_CONTROLS_IA32E_MASK); + + msr->data = efer & ~EFER_LME; + } +} + +#endif + +static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + + return vmcs_readl(sf->base); +} + +static void vmx_get_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + u32 ar; + + var->base = vmcs_readl(sf->base); + var->limit = vmcs_read32(sf->limit); + var->selector = vmcs_read16(sf->selector); + ar = vmcs_read32(sf->ar_bytes); + if (ar & AR_UNUSABLE_MASK) + ar = 0; + var->type = ar & 15; + var->s = (ar >> 4) & 1; + var->dpl = (ar >> 5) & 3; + var->present = (ar >> 7) & 1; + var->avl = (ar >> 12) & 1; + var->l = (ar >> 13) & 1; + var->db = (ar >> 14) & 1; + var->g = (ar >> 15) & 1; + var->unusable = (ar >> 16) & 1; +} + +static void vmx_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + u32 ar; + + vmcs_writel(sf->base, var->base); + vmcs_write32(sf->limit, var->limit); + vmcs_write16(sf->selector, var->selector); + if (var->unusable) + ar = 1 << 16; + else { + ar = var->type & 15; + ar |= (var->s & 1) << 4; + ar |= (var->dpl & 3) << 5; + ar |= (var->present & 1) << 7; + ar |= (var->avl & 1) << 12; + ar |= (var->l & 1) << 13; + ar |= (var->db & 1) << 14; + ar |= (var->g & 1) << 15; + } + if (ar == 0) /* a 0 value means unusable */ + ar = AR_UNUSABLE_MASK; + vmcs_write32(sf->ar_bytes, ar); +} + +static int vmx_is_long_mode(struct kvm_vcpu *vcpu) +{ + return vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_CONTROLS_IA32E_MASK; +} + +static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) +{ + u32 ar = vmcs_read32(GUEST_CS_AR_BYTES); + + *db = (ar >> 14) & 1; + *l = (ar >> 13) & 1; +} + +static void vmx_get_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + dt->limit = vmcs_read32(GUEST_IDTR_LIMIT); + dt->base = vmcs_readl(GUEST_IDTR_BASE); +} + +static void vmx_set_idt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + vmcs_write32(GUEST_IDTR_LIMIT, dt->limit); + vmcs_writel(GUEST_IDTR_BASE, dt->base); +} + +static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + dt->limit = vmcs_read32(GUEST_GDTR_LIMIT); + dt->base = vmcs_readl(GUEST_GDTR_BASE); +} + +static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct descriptor_table *dt) +{ + vmcs_write32(GUEST_GDTR_LIMIT, dt->limit); + vmcs_writel(GUEST_GDTR_BASE, dt->base); +} + +static int init_rmode_tss(struct kvm* kvm) +{ + struct page *p1, *p2, *p3; + gfn_t fn = rmode_tss_base(kvm) >> PAGE_SHIFT; + char *page; + + p1 = _gfn_to_page(kvm, fn++); + p2 = _gfn_to_page(kvm, fn++); + p3 = _gfn_to_page(kvm, fn); + + if (!p1 || !p2 || !p3) { + kvm_printf(kvm,"%s: gfn_to_page failed\n", __FUNCTION__); + return 0; + } + + page = kmap_atomic(p1, KM_USER0); + memset(page, 0, PAGE_SIZE); + *(u16*)(page + 0x66) = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE; + kunmap_atomic(page, KM_USER0); + + page = kmap_atomic(p2, KM_USER0); + memset(page, 0, PAGE_SIZE); + kunmap_atomic(page, KM_USER0); + + page = kmap_atomic(p3, KM_USER0); + memset(page, 0, PAGE_SIZE); + *(page + RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1) = ~0; + kunmap_atomic(page, KM_USER0); + + return 1; +} + +static void vmcs_write32_fixedbits(u32 msr, u32 vmcs_field, u32 val) +{ + u32 msr_high, msr_low; + + rdmsr(msr, msr_low, msr_high); + + val &= msr_high; + val |= msr_low; + vmcs_write32(vmcs_field, val); +} + +static void seg_setup(int seg) +{ + struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; + + vmcs_write16(sf->selector, 0); + vmcs_writel(sf->base, 0); + vmcs_write32(sf->limit, 0xffff); + vmcs_write32(sf->ar_bytes, 0x93); +} + +/* + * Sets up the vmcs for emulated real mode. + */ +static int vmx_vcpu_setup(struct kvm_vcpu *vcpu) +{ + u32 host_sysenter_cs; + u32 junk; + unsigned long a; + struct descriptor_table dt; + int i; + int ret = 0; + int nr_good_msrs; + extern asmlinkage void kvm_vmx_return(void); + + if (!init_rmode_tss(vcpu->kvm)) { + ret = -ENOMEM; + goto out; + } + + memset(vcpu->regs, 0, sizeof(vcpu->regs)); + vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val(); + vcpu->cr8 = 0; + vcpu->apic_base = 0xfee00000 | + /*for vcpu 0*/ MSR_IA32_APICBASE_BSP | + MSR_IA32_APICBASE_ENABLE; + + fx_init(vcpu); + + /* + * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode + * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4. Sigh. + */ + vmcs_write16(GUEST_CS_SELECTOR, 0xf000); + vmcs_writel(GUEST_CS_BASE, 0x000f0000); + vmcs_write32(GUEST_CS_LIMIT, 0xffff); + vmcs_write32(GUEST_CS_AR_BYTES, 0x9b); + + seg_setup(VCPU_SREG_DS); + seg_setup(VCPU_SREG_ES); + seg_setup(VCPU_SREG_FS); + seg_setup(VCPU_SREG_GS); + seg_setup(VCPU_SREG_SS); + + vmcs_write16(GUEST_TR_SELECTOR, 0); + vmcs_writel(GUEST_TR_BASE, 0); + vmcs_write32(GUEST_TR_LIMIT, 0xffff); + vmcs_write32(GUEST_TR_AR_BYTES, 0x008b); + + vmcs_write16(GUEST_LDTR_SELECTOR, 0); + vmcs_writel(GUEST_LDTR_BASE, 0); + vmcs_write32(GUEST_LDTR_LIMIT, 0xffff); + vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082); + + vmcs_write32(GUEST_SYSENTER_CS, 0); + vmcs_writel(GUEST_SYSENTER_ESP, 0); + vmcs_writel(GUEST_SYSENTER_EIP, 0); + + vmcs_writel(GUEST_RFLAGS, 0x02); + vmcs_writel(GUEST_RIP, 0xfff0); + vmcs_writel(GUEST_RSP, 0); + + vmcs_writel(GUEST_CR3, 0); + + //todo: dr0 = dr1 = dr2 = dr3 = 0; dr6 = 0xffff0ff0 + vmcs_writel(GUEST_DR7, 0x400); + + vmcs_writel(GUEST_GDTR_BASE, 0); + vmcs_write32(GUEST_GDTR_LIMIT, 0xffff); + + vmcs_writel(GUEST_IDTR_BASE, 0); + vmcs_write32(GUEST_IDTR_LIMIT, 0xffff); + + vmcs_write32(GUEST_ACTIVITY_STATE, 0); + vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0); + vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0); + + /* I/O */ + vmcs_write64(IO_BITMAP_A, 0); + vmcs_write64(IO_BITMAP_B, 0); + + guest_write_tsc(0); + + vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */ + + /* Special registers */ + vmcs_write64(GUEST_IA32_DEBUGCTL, 0); + + /* Control */ + vmcs_write32_fixedbits(MSR_IA32_VMX_PINBASED_CTLS_MSR, + PIN_BASED_VM_EXEC_CONTROL, + PIN_BASED_EXT_INTR_MASK /* 20.6.1 */ + | PIN_BASED_NMI_EXITING /* 20.6.1 */ + ); + vmcs_write32_fixedbits(MSR_IA32_VMX_PROCBASED_CTLS_MSR, + CPU_BASED_VM_EXEC_CONTROL, + CPU_BASED_HLT_EXITING /* 20.6.2 */ + | CPU_BASED_CR8_LOAD_EXITING /* 20.6.2 */ + | CPU_BASED_CR8_STORE_EXITING /* 20.6.2 */ + | CPU_BASED_UNCOND_IO_EXITING /* 20.6.2 */ + | CPU_BASED_INVDPG_EXITING + | CPU_BASED_MOV_DR_EXITING + | CPU_BASED_USE_TSC_OFFSETING /* 21.3 */ + ); + + vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); + vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */ + + vmcs_writel(HOST_CR0, read_cr0()); /* 22.2.3 */ + vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */ + vmcs_writel(HOST_CR3, read_cr3()); /* 22.2.3 FIXME: shadow tables */ + + vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS); /* 22.2.4 */ + vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS); /* 22.2.4 */ + vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS); /* 22.2.4 */ + vmcs_write16(HOST_FS_SELECTOR, read_fs()); /* 22.2.4 */ + vmcs_write16(HOST_GS_SELECTOR, read_gs()); /* 22.2.4 */ + vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */ +#ifdef __x86_64__ + rdmsrl(MSR_FS_BASE, a); + vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */ + rdmsrl(MSR_GS_BASE, a); + vmcs_writel(HOST_GS_BASE, a); /* 22.2.4 */ +#else + vmcs_writel(HOST_FS_BASE, 0); /* 22.2.4 */ + vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */ +#endif + + vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */ + + get_idt(&dt); + vmcs_writel(HOST_IDTR_BASE, dt.base); /* 22.2.4 */ + + + vmcs_writel(HOST_RIP, (unsigned long)kvm_vmx_return); /* 22.2.5 */ + + rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk); + vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs); + rdmsrl(MSR_IA32_SYSENTER_ESP, a); + vmcs_writel(HOST_IA32_SYSENTER_ESP, a); /* 22.2.3 */ + rdmsrl(MSR_IA32_SYSENTER_EIP, a); + vmcs_writel(HOST_IA32_SYSENTER_EIP, a); /* 22.2.3 */ + + ret = -ENOMEM; + vcpu->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!vcpu->guest_msrs) + goto out; + vcpu->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!vcpu->host_msrs) + goto out_free_guest_msrs; + + for (i = 0; i < NR_VMX_MSR; ++i) { + u32 index = vmx_msr_index[i]; + u32 data_low, data_high; + u64 data; + int j = vcpu->nmsrs; + + if (rdmsr_safe(index, &data_low, &data_high) < 0) + continue; + data = data_low | ((u64)data_high << 32); + vcpu->host_msrs[j].index = index; + vcpu->host_msrs[j].reserved = 0; + vcpu->host_msrs[j].data = data; + vcpu->guest_msrs[j] = vcpu->host_msrs[j]; + ++vcpu->nmsrs; + } + printk("msrs: %d\n", vcpu->nmsrs); + + nr_good_msrs = vcpu->nmsrs - NR_BAD_MSRS; + vmcs_writel(VM_ENTRY_MSR_LOAD_ADDR, + virt_to_phys(vcpu->guest_msrs + NR_BAD_MSRS)); + vmcs_writel(VM_EXIT_MSR_STORE_ADDR, + virt_to_phys(vcpu->guest_msrs + NR_BAD_MSRS)); + vmcs_writel(VM_EXIT_MSR_LOAD_ADDR, + virt_to_phys(vcpu->host_msrs + NR_BAD_MSRS)); + vmcs_write32_fixedbits(MSR_IA32_VMX_EXIT_CTLS_MSR, VM_EXIT_CONTROLS, + (HOST_IS_64 << 9)); /* 22.2,1, 20.7.1 */ + vmcs_write32(VM_EXIT_MSR_STORE_COUNT, nr_good_msrs); /* 22.2.2 */ + vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, nr_good_msrs); /* 22.2.2 */ + vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, nr_good_msrs); /* 22.2.2 */ + + + /* 22.2.1, 20.8.1 */ + vmcs_write32_fixedbits(MSR_IA32_VMX_ENTRY_CTLS_MSR, + VM_ENTRY_CONTROLS, 0); + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */ + + vmcs_writel(VIRTUAL_APIC_PAGE_ADDR, 0); + vmcs_writel(TPR_THRESHOLD, 0); + + vmcs_writel(CR0_GUEST_HOST_MASK, KVM_GUEST_CR0_MASK); + vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK); + + vcpu->cr0 = 0x60000010; + vmx_set_cr0(vcpu, vcpu->cr0); // enter rmode + vmx_set_cr4(vcpu, 0); +#ifdef __x86_64__ + vmx_set_efer(vcpu, 0); +#endif + + return 0; + +out_free_guest_msrs: + kfree(vcpu->guest_msrs); +out: + return ret; +} + +static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq) +{ + u16 ent[2]; + u16 cs; + u16 ip; + unsigned long flags; + unsigned long ss_base = vmcs_readl(GUEST_SS_BASE); + u16 sp = vmcs_readl(GUEST_RSP); + u32 ss_limit = vmcs_read32(GUEST_SS_LIMIT); + + if (sp > ss_limit || sp - 6 > sp) { + vcpu_printf(vcpu, "%s: #SS, rsp 0x%lx ss 0x%lx limit 0x%x\n", + __FUNCTION__, + vmcs_readl(GUEST_RSP), + vmcs_readl(GUEST_SS_BASE), + vmcs_read32(GUEST_SS_LIMIT)); + return; + } + + if (kvm_read_guest(vcpu, irq * sizeof(ent), sizeof(ent), &ent) != + sizeof(ent)) { + vcpu_printf(vcpu, "%s: read guest err\n", __FUNCTION__); + return; + } + + flags = vmcs_readl(GUEST_RFLAGS); + cs = vmcs_readl(GUEST_CS_BASE) >> 4; + ip = vmcs_readl(GUEST_RIP); + + + if (kvm_write_guest(vcpu, ss_base + sp - 2, 2, &flags) != 2 || + kvm_write_guest(vcpu, ss_base + sp - 4, 2, &cs) != 2 || + kvm_write_guest(vcpu, ss_base + sp - 6, 2, &ip) != 2) { + vcpu_printf(vcpu, "%s: write guest err\n", __FUNCTION__); + return; + } + + vmcs_writel(GUEST_RFLAGS, flags & + ~( X86_EFLAGS_IF | X86_EFLAGS_AC | X86_EFLAGS_TF)); + vmcs_write16(GUEST_CS_SELECTOR, ent[1]) ; + vmcs_writel(GUEST_CS_BASE, ent[1] << 4); + vmcs_writel(GUEST_RIP, ent[0]); + vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6)); +} + +static void kvm_do_inject_irq(struct kvm_vcpu *vcpu) +{ + int word_index = __ffs(vcpu->irq_summary); + int bit_index = __ffs(vcpu->irq_pending[word_index]); + int irq = word_index * BITS_PER_LONG + bit_index; + + clear_bit(bit_index, &vcpu->irq_pending[word_index]); + if (!vcpu->irq_pending[word_index]) + clear_bit(word_index, &vcpu->irq_summary); + + if (vcpu->rmode.active) { + inject_rmode_irq(vcpu, irq); + return; + } + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); +} + +static void kvm_try_inject_irq(struct kvm_vcpu *vcpu) +{ + if ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) + && (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0) + /* + * Interrupts enabled, and not blocked by sti or mov ss. Good. + */ + kvm_do_inject_irq(vcpu); + else + /* + * Interrupts blocked. Wait for unblock. + */ + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, + vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) + | CPU_BASED_VIRTUAL_INTR_PENDING); +} + +static void kvm_guest_debug_pre(struct kvm_vcpu *vcpu) +{ + struct kvm_guest_debug *dbg = &vcpu->guest_debug; + + set_debugreg(dbg->bp[0], 0); + set_debugreg(dbg->bp[1], 1); + set_debugreg(dbg->bp[2], 2); + set_debugreg(dbg->bp[3], 3); + + if (dbg->singlestep) { + unsigned long flags; + + flags = vmcs_readl(GUEST_RFLAGS); + flags |= X86_EFLAGS_TF | X86_EFLAGS_RF; + vmcs_writel(GUEST_RFLAGS, flags); + } +} + +static int handle_rmode_exception(struct kvm_vcpu *vcpu, + int vec, u32 err_code) +{ + if (!vcpu->rmode.active) + return 0; + + if (vec == GP_VECTOR && err_code == 0) + if (emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE) + return 1; + return 0; +} + +static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 intr_info, error_code; + unsigned long cr2, rip; + u32 vect_info; + enum emulation_result er; + + vect_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); + intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + + if ((vect_info & VECTORING_INFO_VALID_MASK) && + !is_page_fault(intr_info)) { + printk(KERN_ERR "%s: unexpected, vectoring info 0x%x " + "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info); + } + + if (is_external_interrupt(vect_info)) { + int irq = vect_info & VECTORING_INFO_VECTOR_MASK; + set_bit(irq, vcpu->irq_pending); + set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary); + } + + if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */ + asm ("int $2"); + return 1; + } + error_code = 0; + rip = vmcs_readl(GUEST_RIP); + if (intr_info & INTR_INFO_DELIEVER_CODE_MASK) + error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); + if (is_page_fault(intr_info)) { + cr2 = vmcs_readl(EXIT_QUALIFICATION); + + spin_lock(&vcpu->kvm->lock); + if (!vcpu->mmu.page_fault(vcpu, cr2, error_code)) { + spin_unlock(&vcpu->kvm->lock); + return 1; + } + + er = emulate_instruction(vcpu, kvm_run, cr2, error_code); + spin_unlock(&vcpu->kvm->lock); + + switch (er) { + case EMULATE_DONE: + return 1; + case EMULATE_DO_MMIO: + ++kvm_stat.mmio_exits; + kvm_run->exit_reason = KVM_EXIT_MMIO; + return 0; + case EMULATE_FAIL: + vcpu_printf(vcpu, "%s: emulate fail\n", __FUNCTION__); + break; + default: + BUG(); + } + } + + if (vcpu->rmode.active && + handle_rmode_exception(vcpu, intr_info & INTR_INFO_VECTOR_MASK, + error_code)) + return 1; + + if ((intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK)) == (INTR_TYPE_EXCEPTION | 1)) { + kvm_run->exit_reason = KVM_EXIT_DEBUG; + return 0; + } + kvm_run->exit_reason = KVM_EXIT_EXCEPTION; + kvm_run->ex.exception = intr_info & INTR_INFO_VECTOR_MASK; + kvm_run->ex.error_code = error_code; + return 0; +} + +static int handle_external_interrupt(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + ++kvm_stat.irq_exits; + return 1; +} + + +static int get_io_count(struct kvm_vcpu *vcpu, u64 *count) +{ + u64 inst; + gva_t rip; + int countr_size; + int i, n; + + if ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_VM)) { + countr_size = 2; + } else { + u32 cs_ar = vmcs_read32(GUEST_CS_AR_BYTES); + + countr_size = (cs_ar & AR_L_MASK) ? 8: + (cs_ar & AR_DB_MASK) ? 4: 2; + } + + rip = vmcs_readl(GUEST_RIP); + if (countr_size != 8) + rip += vmcs_readl(GUEST_CS_BASE); + + n = kvm_read_guest(vcpu, rip, sizeof(inst), &inst); + + for (i = 0; i < n; i++) { + switch (((u8*)&inst)[i]) { + case 0xf0: + case 0xf2: + case 0xf3: + case 0x2e: + case 0x36: + case 0x3e: + case 0x26: + case 0x64: + case 0x65: + case 0x66: + break; + case 0x67: + countr_size = (countr_size == 2) ? 4: (countr_size >> 1); + default: + goto done; + } + } + return 0; +done: + countr_size *= 8; + *count = vcpu->regs[VCPU_REGS_RCX] & (~0ULL >> (64 - countr_size)); + return 1; +} + +static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u64 exit_qualification; + + ++kvm_stat.io_exits; + exit_qualification = vmcs_read64(EXIT_QUALIFICATION); + kvm_run->exit_reason = KVM_EXIT_IO; + if (exit_qualification & 8) + kvm_run->io.direction = KVM_EXIT_IO_IN; + else + kvm_run->io.direction = KVM_EXIT_IO_OUT; + kvm_run->io.size = (exit_qualification & 7) + 1; + kvm_run->io.string = (exit_qualification & 16) != 0; + kvm_run->io.string_down + = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_DF) != 0; + kvm_run->io.rep = (exit_qualification & 32) != 0; + kvm_run->io.port = exit_qualification >> 16; + if (kvm_run->io.string) { + if (!get_io_count(vcpu, &kvm_run->io.count)) + return 1; + kvm_run->io.address = vmcs_readl(GUEST_LINEAR_ADDRESS); + } else + kvm_run->io.value = vcpu->regs[VCPU_REGS_RAX]; /* rax */ + return 0; +} + +static int handle_invlpg(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u64 address = vmcs_read64(EXIT_QUALIFICATION); + int instruction_length = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + spin_lock(&vcpu->kvm->lock); + vcpu->mmu.inval_page(vcpu, address); + spin_unlock(&vcpu->kvm->lock); + vmcs_writel(GUEST_RIP, vmcs_readl(GUEST_RIP) + instruction_length); + return 1; +} + +static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u64 exit_qualification; + int cr; + int reg; + + exit_qualification = vmcs_read64(EXIT_QUALIFICATION); + cr = exit_qualification & 15; + reg = (exit_qualification >> 8) & 15; + switch ((exit_qualification >> 4) & 3) { + case 0: /* mov to cr */ + switch (cr) { + case 0: + vcpu_load_rsp_rip(vcpu); + set_cr0(vcpu, vcpu->regs[reg]); + skip_emulated_instruction(vcpu); + return 1; + case 3: + vcpu_load_rsp_rip(vcpu); + set_cr3(vcpu, vcpu->regs[reg]); + skip_emulated_instruction(vcpu); + return 1; + case 4: + vcpu_load_rsp_rip(vcpu); + set_cr4(vcpu, vcpu->regs[reg]); + skip_emulated_instruction(vcpu); + return 1; + case 8: + vcpu_load_rsp_rip(vcpu); + set_cr8(vcpu, vcpu->regs[reg]); + skip_emulated_instruction(vcpu); + return 1; + }; + break; + case 1: /*mov from cr*/ + switch (cr) { + case 3: + vcpu_load_rsp_rip(vcpu); + vcpu->regs[reg] = vcpu->cr3; + vcpu_put_rsp_rip(vcpu); + skip_emulated_instruction(vcpu); + return 1; + case 8: + printk(KERN_DEBUG "handle_cr: read CR8 " + "cpu erratum AA15\n"); + vcpu_load_rsp_rip(vcpu); + vcpu->regs[reg] = vcpu->cr8; + vcpu_put_rsp_rip(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } + break; + case 3: /* lmsw */ + lmsw(vcpu, (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f); + + skip_emulated_instruction(vcpu); + return 1; + default: + break; + } + kvm_run->exit_reason = 0; + printk(KERN_ERR "kvm: unhandled control register: op %d cr %d\n", + (int)(exit_qualification >> 4) & 3, cr); + return 0; +} + +static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u64 exit_qualification; + unsigned long val; + int dr, reg; + + /* + * FIXME: this code assumes the host is debugging the guest. + * need to deal with guest debugging itself too. + */ + exit_qualification = vmcs_read64(EXIT_QUALIFICATION); + dr = exit_qualification & 7; + reg = (exit_qualification >> 8) & 15; + vcpu_load_rsp_rip(vcpu); + if (exit_qualification & 16) { + /* mov from dr */ + switch (dr) { + case 6: + val = 0xffff0ff0; + break; + case 7: + val = 0x400; + break; + default: + val = 0; + } + vcpu->regs[reg] = val; + } else { + /* mov to dr */ + } + vcpu_put_rsp_rip(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + +static int handle_cpuid(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + kvm_run->exit_reason = KVM_EXIT_CPUID; + return 0; +} + +static int handle_rdmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 ecx = vcpu->regs[VCPU_REGS_RCX]; + u64 data; + + if (vmx_get_msr(vcpu, ecx, &data)) { + vmx_inject_gp(vcpu, 0); + return 1; + } + + /* FIXME: handling of bits 32:63 of rax, rdx */ + vcpu->regs[VCPU_REGS_RAX] = data & -1u; + vcpu->regs[VCPU_REGS_RDX] = (data >> 32) & -1u; + skip_emulated_instruction(vcpu); + return 1; +} + +static int handle_wrmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 ecx = vcpu->regs[VCPU_REGS_RCX]; + u64 data = (vcpu->regs[VCPU_REGS_RAX] & -1u) + | ((u64)(vcpu->regs[VCPU_REGS_RDX] & -1u) << 32); + + if (vmx_set_msr(vcpu, ecx, data) != 0) { + vmx_inject_gp(vcpu, 0); + return 1; + } + + skip_emulated_instruction(vcpu); + return 1; +} + +static int handle_interrupt_window(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + /* Turn off interrupt window reporting. */ + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, + vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) + & ~CPU_BASED_VIRTUAL_INTR_PENDING); + return 1; +} + +static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + skip_emulated_instruction(vcpu); + if (vcpu->irq_summary && (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)) + return 1; + + kvm_run->exit_reason = KVM_EXIT_HLT; + return 0; +} + +/* + * The exit handlers return 1 if the exit was handled fully and guest execution + * may resume. Otherwise they set the kvm_run parameter to indicate what needs + * to be done to userspace and return 0. + */ +static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) = { + [EXIT_REASON_EXCEPTION_NMI] = handle_exception, + [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, + [EXIT_REASON_IO_INSTRUCTION] = handle_io, + [EXIT_REASON_INVLPG] = handle_invlpg, + [EXIT_REASON_CR_ACCESS] = handle_cr, + [EXIT_REASON_DR_ACCESS] = handle_dr, + [EXIT_REASON_CPUID] = handle_cpuid, + [EXIT_REASON_MSR_READ] = handle_rdmsr, + [EXIT_REASON_MSR_WRITE] = handle_wrmsr, + [EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window, + [EXIT_REASON_HLT] = handle_halt, +}; + +static const int kvm_vmx_max_exit_handlers = + sizeof(kvm_vmx_exit_handlers) / sizeof(*kvm_vmx_exit_handlers); + +/* + * The guest has exited. See if we can fix it or if we need userspace + * assistance. + */ +static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) +{ + u32 vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); + u32 exit_reason = vmcs_read32(VM_EXIT_REASON); + + if ( (vectoring_info & VECTORING_INFO_VALID_MASK) && + exit_reason != EXIT_REASON_EXCEPTION_NMI ) + printk(KERN_WARNING "%s: unexpected, valid vectoring info and " + "exit reason is 0x%x\n", __FUNCTION__, exit_reason); + kvm_run->instruction_length = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + if (exit_reason < kvm_vmx_max_exit_handlers + && kvm_vmx_exit_handlers[exit_reason]) + return kvm_vmx_exit_handlers[exit_reason](vcpu, kvm_run); + else { + kvm_run->exit_reason = KVM_EXIT_UNKNOWN; + kvm_run->hw.hardware_exit_reason = exit_reason; + } + return 0; +} + +static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u8 fail; + u16 fs_sel, gs_sel, ldt_sel; + int fs_gs_ldt_reload_needed; + +again: + /* + * Set host fs and gs selectors. Unfortunately, 22.2.3 does not + * allow segment selectors with cpl > 0 or ti == 1. + */ + fs_sel = read_fs(); + gs_sel = read_gs(); + ldt_sel = read_ldt(); + fs_gs_ldt_reload_needed = (fs_sel & 7) | (gs_sel & 7) | ldt_sel; + if (!fs_gs_ldt_reload_needed) { + vmcs_write16(HOST_FS_SELECTOR, fs_sel); + vmcs_write16(HOST_GS_SELECTOR, gs_sel); + } else { + vmcs_write16(HOST_FS_SELECTOR, 0); + vmcs_write16(HOST_GS_SELECTOR, 0); + } + +#ifdef __x86_64__ + vmcs_writel(HOST_FS_BASE, read_msr(MSR_FS_BASE)); + vmcs_writel(HOST_GS_BASE, read_msr(MSR_GS_BASE)); +#endif + + if (vcpu->irq_summary && + !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK)) + kvm_try_inject_irq(vcpu); + + if (vcpu->guest_debug.enabled) + kvm_guest_debug_pre(vcpu); + + fx_save(vcpu->host_fx_image); + fx_restore(vcpu->guest_fx_image); + + save_msrs(vcpu->host_msrs, vcpu->nmsrs); + load_msrs(vcpu->guest_msrs, NR_BAD_MSRS); + + asm ( + /* Store host registers */ + "pushf \n\t" +#ifdef __x86_64__ + "push %%rax; push %%rbx; push %%rdx;" + "push %%rsi; push %%rdi; push %%rbp;" + "push %%r8; push %%r9; push %%r10; push %%r11;" + "push %%r12; push %%r13; push %%r14; push %%r15;" + "push %%rcx \n\t" + ASM_VMX_VMWRITE_RSP_RDX "\n\t" +#else + "pusha; push %%ecx \n\t" + ASM_VMX_VMWRITE_RSP_RDX "\n\t" +#endif + /* Check if vmlaunch of vmresume is needed */ + "cmp $0, %1 \n\t" + /* Load guest registers. Don't clobber flags. */ +#ifdef __x86_64__ + "mov %c[cr2](%3), %%rax \n\t" + "mov %%rax, %%cr2 \n\t" + "mov %c[rax](%3), %%rax \n\t" + "mov %c[rbx](%3), %%rbx \n\t" + "mov %c[rdx](%3), %%rdx \n\t" + "mov %c[rsi](%3), %%rsi \n\t" + "mov %c[rdi](%3), %%rdi \n\t" + "mov %c[rbp](%3), %%rbp \n\t" + "mov %c[r8](%3), %%r8 \n\t" + "mov %c[r9](%3), %%r9 \n\t" + "mov %c[r10](%3), %%r10 \n\t" + "mov %c[r11](%3), %%r11 \n\t" + "mov %c[r12](%3), %%r12 \n\t" + "mov %c[r13](%3), %%r13 \n\t" + "mov %c[r14](%3), %%r14 \n\t" + "mov %c[r15](%3), %%r15 \n\t" + "mov %c[rcx](%3), %%rcx \n\t" /* kills %3 (rcx) */ +#else + "mov %c[cr2](%3), %%eax \n\t" + "mov %%eax, %%cr2 \n\t" + "mov %c[rax](%3), %%eax \n\t" + "mov %c[rbx](%3), %%ebx \n\t" + "mov %c[rdx](%3), %%edx \n\t" + "mov %c[rsi](%3), %%esi \n\t" + "mov %c[rdi](%3), %%edi \n\t" + "mov %c[rbp](%3), %%ebp \n\t" + "mov %c[rcx](%3), %%ecx \n\t" /* kills %3 (ecx) */ +#endif + /* Enter guest mode */ + "jne launched \n\t" + ASM_VMX_VMLAUNCH "\n\t" + "jmp kvm_vmx_return \n\t" + "launched: " ASM_VMX_VMRESUME "\n\t" + ".globl kvm_vmx_return \n\t" + "kvm_vmx_return: " + /* Save guest registers, load host registers, keep flags */ +#ifdef __x86_64__ + "xchg %3, 0(%%rsp) \n\t" + "mov %%rax, %c[rax](%3) \n\t" + "mov %%rbx, %c[rbx](%3) \n\t" + "pushq 0(%%rsp); popq %c[rcx](%3) \n\t" + "mov %%rdx, %c[rdx](%3) \n\t" + "mov %%rsi, %c[rsi](%3) \n\t" + "mov %%rdi, %c[rdi](%3) \n\t" + "mov %%rbp, %c[rbp](%3) \n\t" + "mov %%r8, %c[r8](%3) \n\t" + "mov %%r9, %c[r9](%3) \n\t" + "mov %%r10, %c[r10](%3) \n\t" + "mov %%r11, %c[r11](%3) \n\t" + "mov %%r12, %c[r12](%3) \n\t" + "mov %%r13, %c[r13](%3) \n\t" + "mov %%r14, %c[r14](%3) \n\t" + "mov %%r15, %c[r15](%3) \n\t" + "mov %%cr2, %%rax \n\t" + "mov %%rax, %c[cr2](%3) \n\t" + "mov 0(%%rsp), %3 \n\t" + + "pop %%rcx; pop %%r15; pop %%r14; pop %%r13; pop %%r12;" + "pop %%r11; pop %%r10; pop %%r9; pop %%r8;" + "pop %%rbp; pop %%rdi; pop %%rsi;" + "pop %%rdx; pop %%rbx; pop %%rax \n\t" +#else + "xchg %3, 0(%%esp) \n\t" + "mov %%eax, %c[rax](%3) \n\t" + "mov %%ebx, %c[rbx](%3) \n\t" + "pushl 0(%%esp); popl %c[rcx](%3) \n\t" + "mov %%edx, %c[rdx](%3) \n\t" + "mov %%esi, %c[rsi](%3) \n\t" + "mov %%edi, %c[rdi](%3) \n\t" + "mov %%ebp, %c[rbp](%3) \n\t" + "mov %%cr2, %%eax \n\t" + "mov %%eax, %c[cr2](%3) \n\t" + "mov 0(%%esp), %3 \n\t" + + "pop %%ecx; popa \n\t" +#endif + "setbe %0 \n\t" + "popf \n\t" + : "=g" (fail) + : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP), + "c"(vcpu), + [rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])), + [rbx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBX])), + [rcx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RCX])), + [rdx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDX])), + [rsi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RSI])), + [rdi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDI])), + [rbp]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBP])), +#ifdef __x86_64__ + [r8 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R8 ])), + [r9 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R9 ])), + [r10]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R10])), + [r11]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R11])), + [r12]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R12])), + [r13]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R13])), + [r14]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R14])), + [r15]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R15])), +#endif + [cr2]"i"(offsetof(struct kvm_vcpu, cr2)) + : "cc", "memory" ); + + ++kvm_stat.exits; + + save_msrs(vcpu->guest_msrs, NR_BAD_MSRS); + load_msrs(vcpu->host_msrs, NR_BAD_MSRS); + + fx_save(vcpu->guest_fx_image); + fx_restore(vcpu->host_fx_image); + +#ifndef __x86_64__ + asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); +#endif + + kvm_run->exit_type = 0; + if (fail) { + kvm_run->exit_type = KVM_EXIT_TYPE_FAIL_ENTRY; + kvm_run->exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR); + } else { + if (fs_gs_ldt_reload_needed) { + load_ldt(ldt_sel); + load_fs(fs_sel); + /* + * If we have to reload gs, we must take care to + * preserve our gs base. + */ + local_irq_disable(); + load_gs(gs_sel); +#ifdef __x86_64__ + wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); +#endif + local_irq_enable(); + + reload_tss(); + } + vcpu->launched = 1; + kvm_run->exit_type = KVM_EXIT_TYPE_VM_EXIT; + if (kvm_handle_exit(kvm_run, vcpu)) { + /* Give scheduler a change to reschedule. */ + if (signal_pending(current)) { + ++kvm_stat.signal_exits; + return -EINTR; + } + kvm_resched(vcpu); + goto again; + } + } + return 0; +} + +static void vmx_flush_tlb(struct kvm_vcpu *vcpu) +{ + vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3)); +} + +static void vmx_inject_page_fault(struct kvm_vcpu *vcpu, + unsigned long addr, + u32 err_code) +{ + u32 vect_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); + + ++kvm_stat.pf_guest; + + if (is_page_fault(vect_info)) { + printk(KERN_DEBUG "inject_page_fault: " + "double fault 0x%lx @ 0x%lx\n", + addr, vmcs_readl(GUEST_RIP)); + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, 0); + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + DF_VECTOR | + INTR_TYPE_EXCEPTION | + INTR_INFO_DELIEVER_CODE_MASK | + INTR_INFO_VALID_MASK); + return; + } + vcpu->cr2 = addr; + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, err_code); + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + PF_VECTOR | + INTR_TYPE_EXCEPTION | + INTR_INFO_DELIEVER_CODE_MASK | + INTR_INFO_VALID_MASK); + +} + +static void vmx_free_vmcs(struct kvm_vcpu *vcpu) +{ + if (vcpu->vmcs) { + on_each_cpu(__vcpu_clear, vcpu, 0, 1); + free_vmcs(vcpu->vmcs); + vcpu->vmcs = 0; + } +} + +static void vmx_free_vcpu(struct kvm_vcpu *vcpu) +{ + vmx_free_vmcs(vcpu); +} + +static int vmx_create_vcpu(struct kvm_vcpu *vcpu) +{ + struct vmcs *vmcs; + + vmcs = alloc_vmcs(); + if (!vmcs) + return -ENOMEM; + vmcs_clear(vmcs); + vcpu->vmcs = vmcs; + vcpu->launched = 0; + return 0; +} + +static struct kvm_arch_ops vmx_arch_ops = { + .cpu_has_kvm_support = cpu_has_kvm_support, + .disabled_by_bios = vmx_disabled_by_bios, + .hardware_setup = hardware_setup, + .hardware_unsetup = hardware_unsetup, + .hardware_enable = hardware_enable, + .hardware_disable = hardware_disable, + + .vcpu_create = vmx_create_vcpu, + .vcpu_free = vmx_free_vcpu, + + .vcpu_load = vmx_vcpu_load, + .vcpu_put = vmx_vcpu_put, + + .set_guest_debug = set_guest_debug, + .get_msr = vmx_get_msr, + .set_msr = vmx_set_msr, + .get_segment_base = vmx_get_segment_base, + .get_segment = vmx_get_segment, + .set_segment = vmx_set_segment, + .is_long_mode = vmx_is_long_mode, + .get_cs_db_l_bits = vmx_get_cs_db_l_bits, + .set_cr0 = vmx_set_cr0, + .set_cr0_no_modeswitch = vmx_set_cr0_no_modeswitch, + .set_cr3 = vmx_set_cr3, + .set_cr4 = vmx_set_cr4, +#ifdef __x86_64__ + .set_efer = vmx_set_efer, +#endif + .get_idt = vmx_get_idt, + .set_idt = vmx_set_idt, + .get_gdt = vmx_get_gdt, + .set_gdt = vmx_set_gdt, + .cache_regs = vcpu_load_rsp_rip, + .decache_regs = vcpu_put_rsp_rip, + .get_rflags = vmx_get_rflags, + .set_rflags = vmx_set_rflags, + + .tlb_flush = vmx_flush_tlb, + .inject_page_fault = vmx_inject_page_fault, + + .inject_gp = vmx_inject_gp, + + .run = vmx_vcpu_run, + .skip_emulated_instruction = skip_emulated_instruction, + .vcpu_setup = vmx_vcpu_setup, +}; + +static int __init vmx_init(void) +{ + return kvm_init_arch(&vmx_arch_ops, THIS_MODULE); +} + +static void __exit vmx_exit(void) +{ + kvm_exit_arch(); +} + +module_init(vmx_init) +module_exit(vmx_exit) Index: linux/drivers/kvm/vmx.h =================================================================== --- /dev/null +++ linux/drivers/kvm/vmx.h @@ -0,0 +1,296 @@ +#ifndef VMX_H +#define VMX_H + +/* + * vmx.h: VMX Architecture related definitions + * Copyright (c) 2004, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * A few random additions are: + * Copyright (C) 2006 Qumranet + * Avi Kivity + * Yaniv Kamay + * + */ + +#define CPU_BASED_VIRTUAL_INTR_PENDING 0x00000004 +#define CPU_BASED_USE_TSC_OFFSETING 0x00000008 +#define CPU_BASED_HLT_EXITING 0x00000080 +#define CPU_BASED_INVDPG_EXITING 0x00000200 +#define CPU_BASED_MWAIT_EXITING 0x00000400 +#define CPU_BASED_RDPMC_EXITING 0x00000800 +#define CPU_BASED_RDTSC_EXITING 0x00001000 +#define CPU_BASED_CR8_LOAD_EXITING 0x00080000 +#define CPU_BASED_CR8_STORE_EXITING 0x00100000 +#define CPU_BASED_TPR_SHADOW 0x00200000 +#define CPU_BASED_MOV_DR_EXITING 0x00800000 +#define CPU_BASED_UNCOND_IO_EXITING 0x01000000 +#define CPU_BASED_ACTIVATE_IO_BITMAP 0x02000000 +#define CPU_BASED_MSR_BITMAPS 0x10000000 +#define CPU_BASED_MONITOR_EXITING 0x20000000 +#define CPU_BASED_PAUSE_EXITING 0x40000000 + +#define PIN_BASED_EXT_INTR_MASK 0x1 +#define PIN_BASED_NMI_EXITING 0x8 + +#define VM_EXIT_ACK_INTR_ON_EXIT 0x00008000 +#define VM_EXIT_HOST_ADD_SPACE_SIZE 0x00000200 + + +/* VMCS Encodings */ +enum vmcs_field { + GUEST_ES_SELECTOR = 0x00000800, + GUEST_CS_SELECTOR = 0x00000802, + GUEST_SS_SELECTOR = 0x00000804, + GUEST_DS_SELECTOR = 0x00000806, + GUEST_FS_SELECTOR = 0x00000808, + GUEST_GS_SELECTOR = 0x0000080a, + GUEST_LDTR_SELECTOR = 0x0000080c, + GUEST_TR_SELECTOR = 0x0000080e, + HOST_ES_SELECTOR = 0x00000c00, + HOST_CS_SELECTOR = 0x00000c02, + HOST_SS_SELECTOR = 0x00000c04, + HOST_DS_SELECTOR = 0x00000c06, + HOST_FS_SELECTOR = 0x00000c08, + HOST_GS_SELECTOR = 0x00000c0a, + HOST_TR_SELECTOR = 0x00000c0c, + IO_BITMAP_A = 0x00002000, + IO_BITMAP_A_HIGH = 0x00002001, + IO_BITMAP_B = 0x00002002, + IO_BITMAP_B_HIGH = 0x00002003, + MSR_BITMAP = 0x00002004, + MSR_BITMAP_HIGH = 0x00002005, + VM_EXIT_MSR_STORE_ADDR = 0x00002006, + VM_EXIT_MSR_STORE_ADDR_HIGH = 0x00002007, + VM_EXIT_MSR_LOAD_ADDR = 0x00002008, + VM_EXIT_MSR_LOAD_ADDR_HIGH = 0x00002009, + VM_ENTRY_MSR_LOAD_ADDR = 0x0000200a, + VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x0000200b, + TSC_OFFSET = 0x00002010, + TSC_OFFSET_HIGH = 0x00002011, + VIRTUAL_APIC_PAGE_ADDR = 0x00002012, + VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x00002013, + VMCS_LINK_POINTER = 0x00002800, + VMCS_LINK_POINTER_HIGH = 0x00002801, + GUEST_IA32_DEBUGCTL = 0x00002802, + GUEST_IA32_DEBUGCTL_HIGH = 0x00002803, + PIN_BASED_VM_EXEC_CONTROL = 0x00004000, + CPU_BASED_VM_EXEC_CONTROL = 0x00004002, + EXCEPTION_BITMAP = 0x00004004, + PAGE_FAULT_ERROR_CODE_MASK = 0x00004006, + PAGE_FAULT_ERROR_CODE_MATCH = 0x00004008, + CR3_TARGET_COUNT = 0x0000400a, + VM_EXIT_CONTROLS = 0x0000400c, + VM_EXIT_MSR_STORE_COUNT = 0x0000400e, + VM_EXIT_MSR_LOAD_COUNT = 0x00004010, + VM_ENTRY_CONTROLS = 0x00004012, + VM_ENTRY_MSR_LOAD_COUNT = 0x00004014, + VM_ENTRY_INTR_INFO_FIELD = 0x00004016, + VM_ENTRY_EXCEPTION_ERROR_CODE = 0x00004018, + VM_ENTRY_INSTRUCTION_LEN = 0x0000401a, + TPR_THRESHOLD = 0x0000401c, + SECONDARY_VM_EXEC_CONTROL = 0x0000401e, + VM_INSTRUCTION_ERROR = 0x00004400, + VM_EXIT_REASON = 0x00004402, + VM_EXIT_INTR_INFO = 0x00004404, + VM_EXIT_INTR_ERROR_CODE = 0x00004406, + IDT_VECTORING_INFO_FIELD = 0x00004408, + IDT_VECTORING_ERROR_CODE = 0x0000440a, + VM_EXIT_INSTRUCTION_LEN = 0x0000440c, + VMX_INSTRUCTION_INFO = 0x0000440e, + GUEST_ES_LIMIT = 0x00004800, + GUEST_CS_LIMIT = 0x00004802, + GUEST_SS_LIMIT = 0x00004804, + GUEST_DS_LIMIT = 0x00004806, + GUEST_FS_LIMIT = 0x00004808, + GUEST_GS_LIMIT = 0x0000480a, + GUEST_LDTR_LIMIT = 0x0000480c, + GUEST_TR_LIMIT = 0x0000480e, + GUEST_GDTR_LIMIT = 0x00004810, + GUEST_IDTR_LIMIT = 0x00004812, + GUEST_ES_AR_BYTES = 0x00004814, + GUEST_CS_AR_BYTES = 0x00004816, + GUEST_SS_AR_BYTES = 0x00004818, + GUEST_DS_AR_BYTES = 0x0000481a, + GUEST_FS_AR_BYTES = 0x0000481c, + GUEST_GS_AR_BYTES = 0x0000481e, + GUEST_LDTR_AR_BYTES = 0x00004820, + GUEST_TR_AR_BYTES = 0x00004822, + GUEST_INTERRUPTIBILITY_INFO = 0x00004824, + GUEST_ACTIVITY_STATE = 0X00004826, + GUEST_SYSENTER_CS = 0x0000482A, + HOST_IA32_SYSENTER_CS = 0x00004c00, + CR0_GUEST_HOST_MASK = 0x00006000, + CR4_GUEST_HOST_MASK = 0x00006002, + CR0_READ_SHADOW = 0x00006004, + CR4_READ_SHADOW = 0x00006006, + CR3_TARGET_VALUE0 = 0x00006008, + CR3_TARGET_VALUE1 = 0x0000600a, + CR3_TARGET_VALUE2 = 0x0000600c, + CR3_TARGET_VALUE3 = 0x0000600e, + EXIT_QUALIFICATION = 0x00006400, + GUEST_LINEAR_ADDRESS = 0x0000640a, + GUEST_CR0 = 0x00006800, + GUEST_CR3 = 0x00006802, + GUEST_CR4 = 0x00006804, + GUEST_ES_BASE = 0x00006806, + GUEST_CS_BASE = 0x00006808, + GUEST_SS_BASE = 0x0000680a, + GUEST_DS_BASE = 0x0000680c, + GUEST_FS_BASE = 0x0000680e, + GUEST_GS_BASE = 0x00006810, + GUEST_LDTR_BASE = 0x00006812, + GUEST_TR_BASE = 0x00006814, + GUEST_GDTR_BASE = 0x00006816, + GUEST_IDTR_BASE = 0x00006818, + GUEST_DR7 = 0x0000681a, + GUEST_RSP = 0x0000681c, + GUEST_RIP = 0x0000681e, + GUEST_RFLAGS = 0x00006820, + GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822, + GUEST_SYSENTER_ESP = 0x00006824, + GUEST_SYSENTER_EIP = 0x00006826, + HOST_CR0 = 0x00006c00, + HOST_CR3 = 0x00006c02, + HOST_CR4 = 0x00006c04, + HOST_FS_BASE = 0x00006c06, + HOST_GS_BASE = 0x00006c08, + HOST_TR_BASE = 0x00006c0a, + HOST_GDTR_BASE = 0x00006c0c, + HOST_IDTR_BASE = 0x00006c0e, + HOST_IA32_SYSENTER_ESP = 0x00006c10, + HOST_IA32_SYSENTER_EIP = 0x00006c12, + HOST_RSP = 0x00006c14, + HOST_RIP = 0x00006c16, +}; + +#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x80000000 + +#define EXIT_REASON_EXCEPTION_NMI 0 +#define EXIT_REASON_EXTERNAL_INTERRUPT 1 + +#define EXIT_REASON_PENDING_INTERRUPT 7 + +#define EXIT_REASON_TASK_SWITCH 9 +#define EXIT_REASON_CPUID 10 +#define EXIT_REASON_HLT 12 +#define EXIT_REASON_INVLPG 14 +#define EXIT_REASON_RDPMC 15 +#define EXIT_REASON_RDTSC 16 +#define EXIT_REASON_VMCALL 18 +#define EXIT_REASON_VMCLEAR 19 +#define EXIT_REASON_VMLAUNCH 20 +#define EXIT_REASON_VMPTRLD 21 +#define EXIT_REASON_VMPTRST 22 +#define EXIT_REASON_VMREAD 23 +#define EXIT_REASON_VMRESUME 24 +#define EXIT_REASON_VMWRITE 25 +#define EXIT_REASON_VMOFF 26 +#define EXIT_REASON_VMON 27 +#define EXIT_REASON_CR_ACCESS 28 +#define EXIT_REASON_DR_ACCESS 29 +#define EXIT_REASON_IO_INSTRUCTION 30 +#define EXIT_REASON_MSR_READ 31 +#define EXIT_REASON_MSR_WRITE 32 +#define EXIT_REASON_MWAIT_INSTRUCTION 36 + +/* + * Interruption-information format + */ +#define INTR_INFO_VECTOR_MASK 0xff /* 7:0 */ +#define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */ +#define INTR_INFO_DELIEVER_CODE_MASK 0x800 /* 11 */ +#define INTR_INFO_VALID_MASK 0x80000000 /* 31 */ + +#define VECTORING_INFO_VECTOR_MASK INTR_INFO_VECTOR_MASK +#define VECTORING_INFO_TYPE_MASK INTR_INFO_INTR_TYPE_MASK +#define VECTORING_INFO_DELIEVER_CODE_MASK INTR_INFO_DELIEVER_CODE_MASK +#define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK + +#define INTR_TYPE_EXT_INTR (0 << 8) /* external interrupt */ +#define INTR_TYPE_EXCEPTION (3 << 8) /* processor exception */ + +/* + * Exit Qualifications for MOV for Control Register Access + */ +#define CONTROL_REG_ACCESS_NUM 0x7 /* 2:0, number of control register */ +#define CONTROL_REG_ACCESS_TYPE 0x30 /* 5:4, access type */ +#define CONTROL_REG_ACCESS_REG 0xf00 /* 10:8, general purpose register */ +#define LMSW_SOURCE_DATA_SHIFT 16 +#define LMSW_SOURCE_DATA (0xFFFF << LMSW_SOURCE_DATA_SHIFT) /* 16:31 lmsw source */ +#define REG_EAX (0 << 8) +#define REG_ECX (1 << 8) +#define REG_EDX (2 << 8) +#define REG_EBX (3 << 8) +#define REG_ESP (4 << 8) +#define REG_EBP (5 << 8) +#define REG_ESI (6 << 8) +#define REG_EDI (7 << 8) +#define REG_R8 (8 << 8) +#define REG_R9 (9 << 8) +#define REG_R10 (10 << 8) +#define REG_R11 (11 << 8) +#define REG_R12 (12 << 8) +#define REG_R13 (13 << 8) +#define REG_R14 (14 << 8) +#define REG_R15 (15 << 8) + +/* + * Exit Qualifications for MOV for Debug Register Access + */ +#define DEBUG_REG_ACCESS_NUM 0x7 /* 2:0, number of debug register */ +#define DEBUG_REG_ACCESS_TYPE 0x10 /* 4, direction of access */ +#define TYPE_MOV_TO_DR (0 << 4) +#define TYPE_MOV_FROM_DR (1 << 4) +#define DEBUG_REG_ACCESS_REG 0xf00 /* 11:8, general purpose register */ + + +/* segment AR */ +#define SEGMENT_AR_L_MASK (1 << 13) + +/* entry controls */ +#define VM_ENTRY_CONTROLS_IA32E_MASK (1 << 9) + +#define AR_TYPE_ACCESSES_MASK 1 +#define AR_TYPE_READABLE_MASK (1 << 1) +#define AR_TYPE_WRITEABLE_MASK (1 << 2) +#define AR_TYPE_CODE_MASK (1 << 3) +#define AR_TYPE_MASK 0x0f +#define AR_TYPE_BUSY_64_TSS 11 +#define AR_TYPE_BUSY_32_TSS 11 +#define AR_TYPE_BUSY_16_TSS 3 +#define AR_TYPE_LDT 2 + +#define AR_UNUSABLE_MASK (1 << 16) +#define AR_S_MASK (1 << 4) +#define AR_P_MASK (1 << 7) +#define AR_L_MASK (1 << 13) +#define AR_DB_MASK (1 << 14) +#define AR_G_MASK (1 << 15) +#define AR_DPL_SHIFT 5 +#define AR_DPL(ar) (((ar) >> AR_DPL_SHIFT) & 3) + +#define AR_RESERVD_MASK 0xfffe0f00 + +#define CR4_VMXE 0x2000 + +#define MSR_IA32_VMX_BASIC_MSR 0x480 +#define MSR_IA32_FEATURE_CONTROL 0x03a +#define MSR_IA32_VMX_PINBASED_CTLS_MSR 0x481 +#define MSR_IA32_VMX_PROCBASED_CTLS_MSR 0x482 +#define MSR_IA32_VMX_EXIT_CTLS_MSR 0x483 +#define MSR_IA32_VMX_ENTRY_CTLS_MSR 0x484 + +#endif Index: linux/drivers/kvm/x86_emulate.c =================================================================== --- /dev/null +++ linux/drivers/kvm/x86_emulate.c @@ -0,0 +1,1403 @@ +/****************************************************************************** + * x86_emulate.c + * + * Generic x86 (32-bit and 64-bit) instruction decoder and emulator. + * + * Copyright (c) 2005 Keir Fraser + * + * Linux coding style, mod r/m decoder, segment base fixes, real-mode + * privieged instructions: + * + * Copyright (C) 2006 Qumranet + * + * Avi Kivity + * Yaniv Kamay + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4 + */ + +#ifndef __KERNEL__ +#include +#include +#include +#define DPRINTF(_f, _a ...) printf( _f , ## _a ) +#else +#include "kvm.h" +#define DPRINTF(x...) do {} while (0) +#endif +#include "x86_emulate.h" +#include + +/* + * Opcode effective-address decode tables. + * Note that we only emulate instructions that have at least one memory + * operand (excluding implicit stack references). We assume that stack + * references and instruction fetches will never occur in special memory + * areas that require emulation. So, for example, 'mov ,' need + * not be handled. + */ + +/* Operand sizes: 8-bit operands or specified/overridden size. */ +#define ByteOp (1<<0) /* 8-bit operands. */ +/* Destination operand type. */ +#define ImplicitOps (1<<1) /* Implicit in opcode. No generic decode. */ +#define DstReg (2<<1) /* Register operand. */ +#define DstMem (3<<1) /* Memory operand. */ +#define DstMask (3<<1) +/* Source operand type. */ +#define SrcNone (0<<3) /* No source operand. */ +#define SrcImplicit (0<<3) /* Source operand is implicit in the opcode. */ +#define SrcReg (1<<3) /* Register operand. */ +#define SrcMem (2<<3) /* Memory operand. */ +#define SrcMem16 (3<<3) /* Memory operand (16-bit). */ +#define SrcMem32 (4<<3) /* Memory operand (32-bit). */ +#define SrcImm (5<<3) /* Immediate operand. */ +#define SrcImmByte (6<<3) /* 8-bit sign-extended immediate operand. */ +#define SrcMask (7<<3) +/* Generic ModRM decode. */ +#define ModRM (1<<6) +/* Destination is only written; never read. */ +#define Mov (1<<7) + +static u8 opcode_table[256] = { + /* 0x00 - 0x07 */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x08 - 0x0F */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x10 - 0x17 */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x18 - 0x1F */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x20 - 0x27 */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x28 - 0x2F */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x30 - 0x37 */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x38 - 0x3F */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM, + 0, 0, 0, 0, + /* 0x40 - 0x4F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x50 - 0x5F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x60 - 0x6F */ + 0, 0, 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ , + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x70 - 0x7F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x80 - 0x87 */ + ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImm | ModRM, + ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM, + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, + /* 0x88 - 0x8F */ + ByteOp | DstMem | SrcReg | ModRM | Mov, DstMem | SrcReg | ModRM | Mov, + ByteOp | DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + 0, 0, 0, DstMem | SrcNone | ModRM | Mov, + /* 0x90 - 0x9F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xA0 - 0xA7 */ + ByteOp | DstReg | SrcMem | Mov, DstReg | SrcMem | Mov, + ByteOp | DstMem | SrcReg | Mov, DstMem | SrcReg | Mov, + ByteOp | ImplicitOps | Mov, ImplicitOps | Mov, + ByteOp | ImplicitOps, ImplicitOps, + /* 0xA8 - 0xAF */ + 0, 0, ByteOp | ImplicitOps | Mov, ImplicitOps | Mov, + ByteOp | ImplicitOps | Mov, ImplicitOps | Mov, + ByteOp | ImplicitOps, ImplicitOps, + /* 0xB0 - 0xBF */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xC0 - 0xC7 */ + ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM, 0, 0, + 0, 0, ByteOp | DstMem | SrcImm | ModRM | Mov, + DstMem | SrcImm | ModRM | Mov, + /* 0xC8 - 0xCF */ + 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xD0 - 0xD7 */ + ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM, + ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM, + 0, 0, 0, 0, + /* 0xD8 - 0xDF */ + 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xE0 - 0xEF */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xF0 - 0xF7 */ + 0, 0, 0, 0, + 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM, + /* 0xF8 - 0xFF */ + 0, 0, 0, 0, + 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM +}; + +static u8 twobyte_table[256] = { + /* 0x00 - 0x0F */ + 0, SrcMem | ModRM | DstReg, 0, 0, 0, 0, ImplicitOps, 0, + 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, + /* 0x10 - 0x1F */ + 0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0, + /* 0x20 - 0x2F */ + ImplicitOps, ModRM, ImplicitOps, ModRM, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x30 - 0x3F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x40 - 0x47 */ + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + /* 0x48 - 0x4F */ + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov, + /* 0x50 - 0x5F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x60 - 0x6F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x70 - 0x7F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x80 - 0x8F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0x90 - 0x9F */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xA0 - 0xA7 */ + 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0, + /* 0xA8 - 0xAF */ + 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0, + /* 0xB0 - 0xB7 */ + ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, 0, + DstMem | SrcReg | ModRM, + 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem16 | ModRM | Mov, + /* 0xB8 - 0xBF */ + 0, 0, DstMem | SrcImmByte | ModRM, DstMem | SrcReg | ModRM, + 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov, + DstReg | SrcMem16 | ModRM | Mov, + /* 0xC0 - 0xCF */ + 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xD0 - 0xDF */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xE0 - 0xEF */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + /* 0xF0 - 0xFF */ + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 +}; + +/* + * Tell the emulator that of the Group 7 instructions (sgdt, lidt, etc.) we + * are interested only in invlpg and not in any of the rest. + * + * invlpg is a special instruction in that the data it references may not + * be mapped. + */ +void kvm_emulator_want_group7_invlpg(void) +{ + twobyte_table[1] &= ~SrcMem; +} +EXPORT_SYMBOL_GPL(kvm_emulator_want_group7_invlpg); + +/* Type, address-of, and value of an instruction's operand. */ +struct operand { + enum { OP_REG, OP_MEM, OP_IMM } type; + unsigned int bytes; + unsigned long val, orig_val, *ptr; +}; + +/* EFLAGS bit definitions. */ +#define EFLG_OF (1<<11) +#define EFLG_DF (1<<10) +#define EFLG_SF (1<<7) +#define EFLG_ZF (1<<6) +#define EFLG_AF (1<<4) +#define EFLG_PF (1<<2) +#define EFLG_CF (1<<0) + +/* + * Instruction emulation: + * Most instructions are emulated directly via a fragment of inline assembly + * code. This allows us to save/restore EFLAGS and thus very easily pick up + * any modified flags. + */ + +#if defined(__x86_64__) +#define _LO32 "k" /* force 32-bit operand */ +#define _STK "%%rsp" /* stack pointer */ +#elif defined(__i386__) +#define _LO32 "" /* force 32-bit operand */ +#define _STK "%%esp" /* stack pointer */ +#endif + +/* + * These EFLAGS bits are restored from saved value during emulation, and + * any changes are written back to the saved value after emulation. + */ +#define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF) + +/* Before executing instruction: restore necessary bits in EFLAGS. */ +#define _PRE_EFLAGS(_sav, _msk, _tmp) \ + /* EFLAGS = (_sav & _msk) | (EFLAGS & ~_msk); */ \ + "push %"_sav"; " \ + "movl %"_msk",%"_LO32 _tmp"; " \ + "andl %"_LO32 _tmp",("_STK"); " \ + "pushf; " \ + "notl %"_LO32 _tmp"; " \ + "andl %"_LO32 _tmp",("_STK"); " \ + "pop %"_tmp"; " \ + "orl %"_LO32 _tmp",("_STK"); " \ + "popf; " \ + /* _sav &= ~msk; */ \ + "movl %"_msk",%"_LO32 _tmp"; " \ + "notl %"_LO32 _tmp"; " \ + "andl %"_LO32 _tmp",%"_sav"; " + +/* After executing instruction: write-back necessary bits in EFLAGS. */ +#define _POST_EFLAGS(_sav, _msk, _tmp) \ + /* _sav |= EFLAGS & _msk; */ \ + "pushf; " \ + "pop %"_tmp"; " \ + "andl %"_msk",%"_LO32 _tmp"; " \ + "orl %"_LO32 _tmp",%"_sav"; " + +/* Raw emulation: instruction has two explicit operands. */ +#define __emulate_2op_nobyte(_op,_src,_dst,_eflags,_wx,_wy,_lx,_ly,_qx,_qy) \ + do { \ + unsigned long _tmp; \ + \ + switch ((_dst).bytes) { \ + case 2: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","4","2") \ + _op"w %"_wx"3,%1; " \ + _POST_EFLAGS("0","4","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : _wy ((_src).val), "i" (EFLAGS_MASK) ); \ + break; \ + case 4: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","4","2") \ + _op"l %"_lx"3,%1; " \ + _POST_EFLAGS("0","4","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : _ly ((_src).val), "i" (EFLAGS_MASK) ); \ + break; \ + case 8: \ + __emulate_2op_8byte(_op, _src, _dst, \ + _eflags, _qx, _qy); \ + break; \ + } \ + } while (0) + +#define __emulate_2op(_op,_src,_dst,_eflags,_bx,_by,_wx,_wy,_lx,_ly,_qx,_qy) \ + do { \ + unsigned long _tmp; \ + switch ( (_dst).bytes ) \ + { \ + case 1: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","4","2") \ + _op"b %"_bx"3,%1; " \ + _POST_EFLAGS("0","4","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : _by ((_src).val), "i" (EFLAGS_MASK) ); \ + break; \ + default: \ + __emulate_2op_nobyte(_op, _src, _dst, _eflags, \ + _wx, _wy, _lx, _ly, _qx, _qy); \ + break; \ + } \ + } while (0) + +/* Source operand is byte-sized and may be restricted to just %cl. */ +#define emulate_2op_SrcB(_op, _src, _dst, _eflags) \ + __emulate_2op(_op, _src, _dst, _eflags, \ + "b", "c", "b", "c", "b", "c", "b", "c") + +/* Source operand is byte, word, long or quad sized. */ +#define emulate_2op_SrcV(_op, _src, _dst, _eflags) \ + __emulate_2op(_op, _src, _dst, _eflags, \ + "b", "q", "w", "r", _LO32, "r", "", "r") + +/* Source operand is word, long or quad sized. */ +#define emulate_2op_SrcV_nobyte(_op, _src, _dst, _eflags) \ + __emulate_2op_nobyte(_op, _src, _dst, _eflags, \ + "w", "r", _LO32, "r", "", "r") + +/* Instruction has only one explicit operand (no source operand). */ +#define emulate_1op(_op, _dst, _eflags) \ + do { \ + unsigned long _tmp; \ + \ + switch ( (_dst).bytes ) \ + { \ + case 1: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","3","2") \ + _op"b %1; " \ + _POST_EFLAGS("0","3","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : "i" (EFLAGS_MASK) ); \ + break; \ + case 2: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","3","2") \ + _op"w %1; " \ + _POST_EFLAGS("0","3","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : "i" (EFLAGS_MASK) ); \ + break; \ + case 4: \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","3","2") \ + _op"l %1; " \ + _POST_EFLAGS("0","3","2") \ + : "=m" (_eflags), "=m" ((_dst).val), \ + "=&r" (_tmp) \ + : "i" (EFLAGS_MASK) ); \ + break; \ + case 8: \ + __emulate_1op_8byte(_op, _dst, _eflags); \ + break; \ + } \ + } while (0) + +/* Emulate an instruction with quadword operands (x86/64 only). */ +#if defined(__x86_64__) +#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy) \ + do { \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","4","2") \ + _op"q %"_qx"3,%1; " \ + _POST_EFLAGS("0","4","2") \ + : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \ + : _qy ((_src).val), "i" (EFLAGS_MASK) ); \ + } while (0) + +#define __emulate_1op_8byte(_op, _dst, _eflags) \ + do { \ + __asm__ __volatile__ ( \ + _PRE_EFLAGS("0","3","2") \ + _op"q %1; " \ + _POST_EFLAGS("0","3","2") \ + : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \ + : "i" (EFLAGS_MASK) ); \ + } while (0) + +#elif defined(__i386__) +#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy) +#define __emulate_1op_8byte(_op, _dst, _eflags) +#endif /* __i386__ */ + +/* Fetch next part of the instruction being emulated. */ +#define insn_fetch(_type, _size, _eip) \ +({ unsigned long _x; \ + rc = ops->read_std((unsigned long)(_eip) + ctxt->cs_base, &_x, \ + (_size), ctxt); \ + if ( rc != 0 ) \ + goto done; \ + (_eip) += (_size); \ + (_type)_x; \ +}) + +/* Access/update address held in a register, based on addressing mode. */ +#define register_address(base, reg) \ + ((base) + ((ad_bytes == sizeof(unsigned long)) ? (reg) : \ + ((reg) & ((1UL << (ad_bytes << 3)) - 1)))) + +#define register_address_increment(reg, inc) \ + do { \ + /* signed type ensures sign extension to long */ \ + int _inc = (inc); \ + if ( ad_bytes == sizeof(unsigned long) ) \ + (reg) += _inc; \ + else \ + (reg) = ((reg) & ~((1UL << (ad_bytes << 3)) - 1)) | \ + (((reg) + _inc) & ((1UL << (ad_bytes << 3)) - 1)); \ + } while (0) + +void *decode_register(u8 modrm_reg, unsigned long *regs, + int highbyte_regs) +{ + void *p; + + p = ®s[modrm_reg]; + if (highbyte_regs && modrm_reg >= 4 && modrm_reg < 8) + p = (unsigned char *)®s[modrm_reg & 3] + 1; + return p; +} + +static int read_descriptor(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops, + void *ptr, + u16 *size, unsigned long *address, int op_bytes) +{ + int rc; + + if (op_bytes == 2) + op_bytes = 3; + *address = 0; + rc = ops->read_std((unsigned long)ptr, (unsigned long *)size, 2, ctxt); + if (rc) + return rc; + rc = ops->read_std((unsigned long)ptr + 2, address, op_bytes, ctxt); + return rc; +} + +int +x86_emulate_memop(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) +{ + u8 b, d, sib, twobyte = 0, rex_prefix = 0; + u8 modrm, modrm_mod = 0, modrm_reg = 0, modrm_rm = 0; + unsigned long *override_base = NULL; + unsigned int op_bytes, ad_bytes, lock_prefix = 0, rep_prefix = 0, i; + int rc = 0; + struct operand src, dst; + unsigned long cr2 = ctxt->cr2; + int mode = ctxt->mode; + unsigned long modrm_ea; + int use_modrm_ea, index_reg = 0, base_reg = 0, scale, rip_relative = 0; + + /* Shadow copy of register state. Committed on successful emulation. */ + unsigned long _regs[NR_VCPU_REGS]; + unsigned long _eip = ctxt->vcpu->rip, _eflags = ctxt->eflags; + unsigned long modrm_val = 0; + + memcpy(_regs, ctxt->vcpu->regs, sizeof _regs); + + switch (mode) { + case X86EMUL_MODE_REAL: + case X86EMUL_MODE_PROT16: + op_bytes = ad_bytes = 2; + break; + case X86EMUL_MODE_PROT32: + op_bytes = ad_bytes = 4; + break; +#ifdef __x86_64__ + case X86EMUL_MODE_PROT64: + op_bytes = 4; + ad_bytes = 8; + break; +#endif + default: + return -1; + } + + /* Legacy prefixes. */ + for (i = 0; i < 8; i++) { + switch (b = insn_fetch(u8, 1, _eip)) { + case 0x66: /* operand-size override */ + op_bytes ^= 6; /* switch between 2/4 bytes */ + break; + case 0x67: /* address-size override */ + if (mode == X86EMUL_MODE_PROT64) + ad_bytes ^= 12; /* switch between 4/8 bytes */ + else + ad_bytes ^= 6; /* switch between 2/4 bytes */ + break; + case 0x2e: /* CS override */ + override_base = &ctxt->cs_base; + break; + case 0x3e: /* DS override */ + override_base = &ctxt->ds_base; + break; + case 0x26: /* ES override */ + override_base = &ctxt->es_base; + break; + case 0x64: /* FS override */ + override_base = &ctxt->fs_base; + break; + case 0x65: /* GS override */ + override_base = &ctxt->gs_base; + break; + case 0x36: /* SS override */ + override_base = &ctxt->ss_base; + break; + case 0xf0: /* LOCK */ + lock_prefix = 1; + break; + case 0xf3: /* REP/REPE/REPZ */ + rep_prefix = 1; + break; + case 0xf2: /* REPNE/REPNZ */ + break; + default: + goto done_prefixes; + } + } + +done_prefixes: + + /* REX prefix. */ + if ((mode == X86EMUL_MODE_PROT64) && ((b & 0xf0) == 0x40)) { + rex_prefix = b; + if (b & 8) + op_bytes = 8; /* REX.W */ + modrm_reg = (b & 4) << 1; /* REX.R */ + index_reg = (b & 2) << 2; /* REX.X */ + modrm_rm = base_reg = (b & 1) << 3; /* REG.B */ + b = insn_fetch(u8, 1, _eip); + } + + /* Opcode byte(s). */ + d = opcode_table[b]; + if (d == 0) { + /* Two-byte opcode? */ + if (b == 0x0f) { + twobyte = 1; + b = insn_fetch(u8, 1, _eip); + d = twobyte_table[b]; + } + + /* Unrecognised? */ + if (d == 0) + goto cannot_emulate; + } + + /* ModRM and SIB bytes. */ + if (d & ModRM) { + modrm = insn_fetch(u8, 1, _eip); + modrm_mod |= (modrm & 0xc0) >> 6; + modrm_reg |= (modrm & 0x38) >> 3; + modrm_rm |= (modrm & 0x07); + modrm_ea = 0; + use_modrm_ea = 1; + + if (modrm_mod == 3) { + modrm_val = *(unsigned long *) + decode_register(modrm_rm, _regs, d & ByteOp); + goto modrm_done; + } + + if (ad_bytes == 2) { + unsigned bx = _regs[VCPU_REGS_RBX]; + unsigned bp = _regs[VCPU_REGS_RBP]; + unsigned si = _regs[VCPU_REGS_RSI]; + unsigned di = _regs[VCPU_REGS_RDI]; + + /* 16-bit ModR/M decode. */ + switch (modrm_mod) { + case 0: + if (modrm_rm == 6) + modrm_ea += insn_fetch(u16, 2, _eip); + break; + case 1: + modrm_ea += insn_fetch(s8, 1, _eip); + break; + case 2: + modrm_ea += insn_fetch(u16, 2, _eip); + break; + } + switch (modrm_rm) { + case 0: + modrm_ea += bx + si; + break; + case 1: + modrm_ea += bx + di; + break; + case 2: + modrm_ea += bp + si; + break; + case 3: + modrm_ea += bp + di; + break; + case 4: + modrm_ea += si; + break; + case 5: + modrm_ea += di; + break; + case 6: + if (modrm_mod != 0) + modrm_ea += bp; + break; + case 7: + modrm_ea += bx; + break; + } + if (modrm_rm == 2 || modrm_rm == 3 || + (modrm_rm == 6 && modrm_mod != 0)) + if (!override_base) + override_base = &ctxt->ss_base; + modrm_ea = (u16)modrm_ea; + } else { + /* 32/64-bit ModR/M decode. */ + switch (modrm_rm) { + case 4: + case 12: + sib = insn_fetch(u8, 1, _eip); + index_reg |= (sib >> 3) & 7; + base_reg |= sib & 7; + scale = sib >> 6; + + switch (base_reg) { + case 5: + if (modrm_mod != 0) + modrm_ea += _regs[base_reg]; + else + modrm_ea += insn_fetch(s32, 4, _eip); + break; + default: + modrm_ea += _regs[base_reg]; + } + switch (index_reg) { + case 4: + break; + default: + modrm_ea += _regs[index_reg] << scale; + + } + break; + case 5: + if (modrm_mod != 0) + modrm_ea += _regs[modrm_rm]; + else if (mode == X86EMUL_MODE_PROT64) + rip_relative = 1; + break; + default: + modrm_ea += _regs[modrm_rm]; + break; + } + switch (modrm_mod) { + case 0: + if (modrm_rm == 5) + modrm_ea += insn_fetch(s32, 4, _eip); + break; + case 1: + modrm_ea += insn_fetch(s8, 1, _eip); + break; + case 2: + modrm_ea += insn_fetch(s32, 4, _eip); + break; + } + } + if (!override_base) + override_base = &ctxt->ds_base; + if (mode == X86EMUL_MODE_PROT64 && + override_base != &ctxt->fs_base && + override_base != &ctxt->gs_base) + override_base = 0; + + if (override_base) + modrm_ea += *override_base; + + if (rip_relative) { + modrm_ea += _eip; + switch (d & SrcMask) { + case SrcImmByte: + modrm_ea += 1; + break; + case SrcImm: + if (d & ByteOp) + modrm_ea += 1; + else + if (op_bytes == 8) + modrm_ea += 4; + else + modrm_ea += op_bytes; + } + } + if (ad_bytes != 8) + modrm_ea = (u32)modrm_ea; + cr2 = modrm_ea; + modrm_done: + ; + } + + /* Decode and fetch the destination operand: register or memory. */ + switch (d & DstMask) { + case ImplicitOps: + /* Special instructions do their own operand decoding. */ + goto special_insn; + case DstReg: + dst.type = OP_REG; + if ((d & ByteOp) + && !(twobyte_table && (b == 0xb6 || b == 0xb7))) { + dst.ptr = decode_register(modrm_reg, _regs, + (rex_prefix == 0)); + dst.val = *(u8 *) dst.ptr; + dst.bytes = 1; + } else { + dst.ptr = decode_register(modrm_reg, _regs, 0); + switch ((dst.bytes = op_bytes)) { + case 2: + dst.val = *(u16 *)dst.ptr; + break; + case 4: + dst.val = *(u32 *)dst.ptr; + break; + case 8: + dst.val = *(u64 *)dst.ptr; + break; + } + } + break; + case DstMem: + dst.type = OP_MEM; + dst.ptr = (unsigned long *)cr2; + dst.bytes = (d & ByteOp) ? 1 : op_bytes; + if (!(d & Mov) && /* optimisation - avoid slow emulated read */ + ((rc = ops->read_emulated((unsigned long)dst.ptr, + &dst.val, dst.bytes, ctxt)) != 0)) + goto done; + break; + } + dst.orig_val = dst.val; + + /* + * Decode and fetch the source operand: register, memory + * or immediate. + */ + switch (d & SrcMask) { + case SrcNone: + break; + case SrcReg: + src.type = OP_REG; + if (d & ByteOp) { + src.ptr = decode_register(modrm_reg, _regs, + (rex_prefix == 0)); + src.val = src.orig_val = *(u8 *) src.ptr; + src.bytes = 1; + } else { + src.ptr = decode_register(modrm_reg, _regs, 0); + switch ((src.bytes = op_bytes)) { + case 2: + src.val = src.orig_val = *(u16 *) src.ptr; + break; + case 4: + src.val = src.orig_val = *(u32 *) src.ptr; + break; + case 8: + src.val = src.orig_val = *(u64 *) src.ptr; + break; + } + } + break; + case SrcMem16: + src.bytes = 2; + goto srcmem_common; + case SrcMem32: + src.bytes = 4; + goto srcmem_common; + case SrcMem: + src.bytes = (d & ByteOp) ? 1 : op_bytes; + srcmem_common: + src.type = OP_MEM; + src.ptr = (unsigned long *)cr2; + if ((rc = ops->read_emulated((unsigned long)src.ptr, + &src.val, src.bytes, ctxt)) != 0) + goto done; + src.orig_val = src.val; + break; + case SrcImm: + src.type = OP_IMM; + src.ptr = (unsigned long *)_eip; + src.bytes = (d & ByteOp) ? 1 : op_bytes; + if (src.bytes == 8) + src.bytes = 4; + /* NB. Immediates are sign-extended as necessary. */ + switch (src.bytes) { + case 1: + src.val = insn_fetch(s8, 1, _eip); + break; + case 2: + src.val = insn_fetch(s16, 2, _eip); + break; + case 4: + src.val = insn_fetch(s32, 4, _eip); + break; + } + break; + case SrcImmByte: + src.type = OP_IMM; + src.ptr = (unsigned long *)_eip; + src.bytes = 1; + src.val = insn_fetch(s8, 1, _eip); + break; + } + + if (twobyte) + goto twobyte_insn; + + switch (b) { + case 0x00 ... 0x05: + add: /* add */ + emulate_2op_SrcV("add", src, dst, _eflags); + break; + case 0x08 ... 0x0d: + or: /* or */ + emulate_2op_SrcV("or", src, dst, _eflags); + break; + case 0x10 ... 0x15: + adc: /* adc */ + emulate_2op_SrcV("adc", src, dst, _eflags); + break; + case 0x18 ... 0x1d: + sbb: /* sbb */ + emulate_2op_SrcV("sbb", src, dst, _eflags); + break; + case 0x20 ... 0x25: + and: /* and */ + emulate_2op_SrcV("and", src, dst, _eflags); + break; + case 0x28 ... 0x2d: + sub: /* sub */ + emulate_2op_SrcV("sub", src, dst, _eflags); + break; + case 0x30 ... 0x35: + xor: /* xor */ + emulate_2op_SrcV("xor", src, dst, _eflags); + break; + case 0x38 ... 0x3d: + cmp: /* cmp */ + emulate_2op_SrcV("cmp", src, dst, _eflags); + break; + case 0x63: /* movsxd */ + if (mode != X86EMUL_MODE_PROT64) + goto cannot_emulate; + dst.val = (s32) src.val; + break; + case 0x80 ... 0x83: /* Grp1 */ + switch (modrm_reg) { + case 0: + goto add; + case 1: + goto or; + case 2: + goto adc; + case 3: + goto sbb; + case 4: + goto and; + case 5: + goto sub; + case 6: + goto xor; + case 7: + goto cmp; + } + break; + case 0x84 ... 0x85: + test: /* test */ + emulate_2op_SrcV("test", src, dst, _eflags); + break; + case 0x86 ... 0x87: /* xchg */ + /* Write back the register source. */ + switch (dst.bytes) { + case 1: + *(u8 *) src.ptr = (u8) dst.val; + break; + case 2: + *(u16 *) src.ptr = (u16) dst.val; + break; + case 4: + *src.ptr = (u32) dst.val; + break; /* 64b reg: zero-extend */ + case 8: + *src.ptr = dst.val; + break; + } + /* + * Write back the memory destination with implicit LOCK + * prefix. + */ + dst.val = src.val; + lock_prefix = 1; + break; + case 0xa0 ... 0xa1: /* mov */ + dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; + dst.val = src.val; + _eip += ad_bytes; /* skip src displacement */ + break; + case 0xa2 ... 0xa3: /* mov */ + dst.val = (unsigned long)_regs[VCPU_REGS_RAX]; + _eip += ad_bytes; /* skip dst displacement */ + break; + case 0x88 ... 0x8b: /* mov */ + case 0xc6 ... 0xc7: /* mov (sole member of Grp11) */ + dst.val = src.val; + break; + case 0x8f: /* pop (sole member of Grp1a) */ + /* 64-bit mode: POP always pops a 64-bit operand. */ + if (mode == X86EMUL_MODE_PROT64) + dst.bytes = 8; + if ((rc = ops->read_std(register_address(ctxt->ss_base, + _regs[VCPU_REGS_RSP]), + &dst.val, dst.bytes, ctxt)) != 0) + goto done; + register_address_increment(_regs[VCPU_REGS_RSP], dst.bytes); + break; + case 0xc0 ... 0xc1: + grp2: /* Grp2 */ + switch (modrm_reg) { + case 0: /* rol */ + emulate_2op_SrcB("rol", src, dst, _eflags); + break; + case 1: /* ror */ + emulate_2op_SrcB("ror", src, dst, _eflags); + break; + case 2: /* rcl */ + emulate_2op_SrcB("rcl", src, dst, _eflags); + break; + case 3: /* rcr */ + emulate_2op_SrcB("rcr", src, dst, _eflags); + break; + case 4: /* sal/shl */ + case 6: /* sal/shl */ + emulate_2op_SrcB("sal", src, dst, _eflags); + break; + case 5: /* shr */ + emulate_2op_SrcB("shr", src, dst, _eflags); + break; + case 7: /* sar */ + emulate_2op_SrcB("sar", src, dst, _eflags); + break; + } + break; + case 0xd0 ... 0xd1: /* Grp2 */ + src.val = 1; + goto grp2; + case 0xd2 ... 0xd3: /* Grp2 */ + src.val = _regs[VCPU_REGS_RCX]; + goto grp2; + case 0xf6 ... 0xf7: /* Grp3 */ + switch (modrm_reg) { + case 0 ... 1: /* test */ + /* + * Special case in Grp3: test has an immediate + * source operand. + */ + src.type = OP_IMM; + src.ptr = (unsigned long *)_eip; + src.bytes = (d & ByteOp) ? 1 : op_bytes; + if (src.bytes == 8) + src.bytes = 4; + switch (src.bytes) { + case 1: + src.val = insn_fetch(s8, 1, _eip); + break; + case 2: + src.val = insn_fetch(s16, 2, _eip); + break; + case 4: + src.val = insn_fetch(s32, 4, _eip); + break; + } + goto test; + case 2: /* not */ + dst.val = ~dst.val; + break; + case 3: /* neg */ + emulate_1op("neg", dst, _eflags); + break; + default: + goto cannot_emulate; + } + break; + case 0xfe ... 0xff: /* Grp4/Grp5 */ + switch (modrm_reg) { + case 0: /* inc */ + emulate_1op("inc", dst, _eflags); + break; + case 1: /* dec */ + emulate_1op("dec", dst, _eflags); + break; + case 6: /* push */ + /* 64-bit mode: PUSH always pushes a 64-bit operand. */ + if (mode == X86EMUL_MODE_PROT64) { + dst.bytes = 8; + if ((rc = ops->read_std((unsigned long)dst.ptr, + &dst.val, 8, + ctxt)) != 0) + goto done; + } + register_address_increment(_regs[VCPU_REGS_RSP], + -dst.bytes); + if ((rc = ops->write_std( + register_address(ctxt->ss_base, + _regs[VCPU_REGS_RSP]), + dst.val, dst.bytes, ctxt)) != 0) + goto done; + dst.val = dst.orig_val; /* skanky: disable writeback */ + break; + default: + goto cannot_emulate; + } + break; + } + +writeback: + if ((d & Mov) || (dst.orig_val != dst.val)) { + switch (dst.type) { + case OP_REG: + /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ + switch (dst.bytes) { + case 1: + *(u8 *)dst.ptr = (u8)dst.val; + break; + case 2: + *(u16 *)dst.ptr = (u16)dst.val; + break; + case 4: + *dst.ptr = (u32)dst.val; + break; /* 64b: zero-ext */ + case 8: + *dst.ptr = dst.val; + break; + } + break; + case OP_MEM: + if (lock_prefix) + rc = ops->cmpxchg_emulated((unsigned long)dst. + ptr, dst.orig_val, + dst.val, dst.bytes, + ctxt); + else + rc = ops->write_emulated((unsigned long)dst.ptr, + dst.val, dst.bytes, + ctxt); + if (rc != 0) + goto done; + default: + break; + } + } + + /* Commit shadow register state. */ + memcpy(ctxt->vcpu->regs, _regs, sizeof _regs); + ctxt->eflags = _eflags; + ctxt->vcpu->rip = _eip; + +done: + return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0; + +special_insn: + if (twobyte) + goto twobyte_special_insn; + if (rep_prefix) { + if (_regs[VCPU_REGS_RCX] == 0) { + ctxt->vcpu->rip = _eip; + goto done; + } + _regs[VCPU_REGS_RCX]--; + _eip = ctxt->vcpu->rip; + } + switch (b) { + case 0xa4 ... 0xa5: /* movs */ + dst.type = OP_MEM; + dst.bytes = (d & ByteOp) ? 1 : op_bytes; + dst.ptr = (unsigned long *)register_address(ctxt->es_base, + _regs[VCPU_REGS_RDI]); + if ((rc = ops->read_emulated(register_address( + override_base ? *override_base : ctxt->ds_base, + _regs[VCPU_REGS_RSI]), &dst.val, dst.bytes, ctxt)) != 0) + goto done; + register_address_increment(_regs[VCPU_REGS_RSI], + (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + register_address_increment(_regs[VCPU_REGS_RDI], + (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + break; + case 0xa6 ... 0xa7: /* cmps */ + DPRINTF("Urk! I don't handle CMPS.\n"); + goto cannot_emulate; + case 0xaa ... 0xab: /* stos */ + dst.type = OP_MEM; + dst.bytes = (d & ByteOp) ? 1 : op_bytes; + dst.ptr = (unsigned long *)cr2; + dst.val = _regs[VCPU_REGS_RAX]; + register_address_increment(_regs[VCPU_REGS_RDI], + (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + break; + case 0xac ... 0xad: /* lods */ + dst.type = OP_REG; + dst.bytes = (d & ByteOp) ? 1 : op_bytes; + dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; + if ((rc = ops->read_emulated(cr2, &dst.val, dst.bytes, ctxt)) != 0) + goto done; + register_address_increment(_regs[VCPU_REGS_RSI], + (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes); + break; + case 0xae ... 0xaf: /* scas */ + DPRINTF("Urk! I don't handle SCAS.\n"); + goto cannot_emulate; + } + goto writeback; + +twobyte_insn: + switch (b) { + case 0x01: /* lgdt, lidt, lmsw */ + switch (modrm_reg) { + u16 size; + unsigned long address; + + case 2: /* lgdt */ + rc = read_descriptor(ctxt, ops, src.ptr, + &size, &address, op_bytes); + if (rc) + goto done; + realmode_lgdt(ctxt->vcpu, size, address); + break; + case 3: /* lidt */ + rc = read_descriptor(ctxt, ops, src.ptr, + &size, &address, op_bytes); + if (rc) + goto done; + realmode_lidt(ctxt->vcpu, size, address); + break; + case 6: /* lmsw */ + realmode_lmsw(ctxt->vcpu, (u16)modrm_val, &_eflags); + break; + case 7: /* invlpg*/ + emulate_invlpg(ctxt->vcpu, cr2); + break; + default: + goto cannot_emulate; + } + break; + case 0x21: /* mov from dr to reg */ + if (modrm_mod != 3) + goto cannot_emulate; + rc = emulator_get_dr(ctxt, modrm_reg, &_regs[modrm_rm]); + break; + case 0x23: /* mov from reg to dr */ + if (modrm_mod != 3) + goto cannot_emulate; + rc = emulator_set_dr(ctxt, modrm_reg, _regs[modrm_rm]); + break; + case 0x40 ... 0x4f: /* cmov */ + dst.val = dst.orig_val = src.val; + d &= ~Mov; /* default to no move */ + /* + * First, assume we're decoding an even cmov opcode + * (lsb == 0). + */ + switch ((b & 15) >> 1) { + case 0: /* cmovo */ + d |= (_eflags & EFLG_OF) ? Mov : 0; + break; + case 1: /* cmovb/cmovc/cmovnae */ + d |= (_eflags & EFLG_CF) ? Mov : 0; + break; + case 2: /* cmovz/cmove */ + d |= (_eflags & EFLG_ZF) ? Mov : 0; + break; + case 3: /* cmovbe/cmovna */ + d |= (_eflags & (EFLG_CF | EFLG_ZF)) ? Mov : 0; + break; + case 4: /* cmovs */ + d |= (_eflags & EFLG_SF) ? Mov : 0; + break; + case 5: /* cmovp/cmovpe */ + d |= (_eflags & EFLG_PF) ? Mov : 0; + break; + case 7: /* cmovle/cmovng */ + d |= (_eflags & EFLG_ZF) ? Mov : 0; + /* fall through */ + case 6: /* cmovl/cmovnge */ + d |= (!(_eflags & EFLG_SF) != + !(_eflags & EFLG_OF)) ? Mov : 0; + break; + } + /* Odd cmov opcodes (lsb == 1) have inverted sense. */ + d ^= (b & 1) ? Mov : 0; + break; + case 0xb0 ... 0xb1: /* cmpxchg */ + /* + * Save real source value, then compare EAX against + * destination. + */ + src.orig_val = src.val; + src.val = _regs[VCPU_REGS_RAX]; + emulate_2op_SrcV("cmp", src, dst, _eflags); + /* Always write back. The question is: where to? */ + d |= Mov; + if (_eflags & EFLG_ZF) { + /* Success: write back to memory. */ + dst.val = src.orig_val; + } else { + /* Failure: write the value we saw to EAX. */ + dst.type = OP_REG; + dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX]; + } + break; + case 0xa3: + bt: /* bt */ + src.val &= (dst.bytes << 3) - 1; /* only subword offset */ + emulate_2op_SrcV_nobyte("bt", src, dst, _eflags); + break; + case 0xb3: + btr: /* btr */ + src.val &= (dst.bytes << 3) - 1; /* only subword offset */ + emulate_2op_SrcV_nobyte("btr", src, dst, _eflags); + break; + case 0xab: + bts: /* bts */ + src.val &= (dst.bytes << 3) - 1; /* only subword offset */ + emulate_2op_SrcV_nobyte("bts", src, dst, _eflags); + break; + case 0xb6 ... 0xb7: /* movzx */ + dst.bytes = op_bytes; + dst.val = (d & ByteOp) ? (u8) src.val : (u16) src.val; + break; + case 0xbb: + btc: /* btc */ + src.val &= (dst.bytes << 3) - 1; /* only subword offset */ + emulate_2op_SrcV_nobyte("btc", src, dst, _eflags); + break; + case 0xba: /* Grp8 */ + switch (modrm_reg & 3) { + case 0: + goto bt; + case 1: + goto bts; + case 2: + goto btr; + case 3: + goto btc; + } + break; + case 0xbe ... 0xbf: /* movsx */ + dst.bytes = op_bytes; + dst.val = (d & ByteOp) ? (s8) src.val : (s16) src.val; + break; + } + goto writeback; + +twobyte_special_insn: + /* Disable writeback. */ + dst.orig_val = dst.val; + switch (b) { + case 0x0d: /* GrpP (prefetch) */ + case 0x18: /* Grp16 (prefetch/nop) */ + break; + case 0x06: + emulate_clts(ctxt->vcpu); + break; + case 0x20: /* mov cr, reg */ + b = insn_fetch(u8, 1, _eip); + if ((b & 0xc0) != 0xc0) + goto cannot_emulate; + _regs[b & 7] = realmode_get_cr(ctxt->vcpu, (b >> 3) & 7); + break; + case 0x22: /* mov reg, cr */ + b = insn_fetch(u8, 1, _eip); + if ((b & 0xc0) != 0xc0) + goto cannot_emulate; + realmode_set_cr(ctxt->vcpu, (b >> 3) & 7, _regs[b & 7] & -1u, + &_eflags); + break; + case 0xc7: /* Grp9 (cmpxchg8b) */ +#if defined(__i386__) + { + unsigned long old_lo, old_hi; + if (((rc = ops->read_emulated(cr2 + 0, &old_lo, 4, + ctxt)) != 0) + || ((rc = ops->read_emulated(cr2 + 4, &old_hi, 4, + ctxt)) != 0)) + goto done; + if ((old_lo != _regs[VCPU_REGS_RAX]) + || (old_hi != _regs[VCPU_REGS_RDI])) { + _regs[VCPU_REGS_RAX] = old_lo; + _regs[VCPU_REGS_RDX] = old_hi; + _eflags &= ~EFLG_ZF; + } else if (ops->cmpxchg8b_emulated == NULL) { + rc = X86EMUL_UNHANDLEABLE; + goto done; + } else { + if ((rc = ops->cmpxchg8b_emulated(cr2, old_lo, + old_hi, + _regs[VCPU_REGS_RBX], + _regs[VCPU_REGS_RCX], + ctxt)) != 0) + goto done; + _eflags |= EFLG_ZF; + } + break; + } +#elif defined(__x86_64__) + { + unsigned long old, new; + if ((rc = ops->read_emulated(cr2, &old, 8, ctxt)) != 0) + goto done; + if (((u32) (old >> 0) != (u32) _regs[VCPU_REGS_RAX]) || + ((u32) (old >> 32) != (u32) _regs[VCPU_REGS_RDX])) { + _regs[VCPU_REGS_RAX] = (u32) (old >> 0); + _regs[VCPU_REGS_RDX] = (u32) (old >> 32); + _eflags &= ~EFLG_ZF; + } else { + new = (_regs[VCPU_REGS_RCX] << 32) | (u32) _regs[VCPU_REGS_RBX]; + if ((rc = ops->cmpxchg_emulated(cr2, old, + new, 8, ctxt)) != 0) + goto done; + _eflags |= EFLG_ZF; + } + break; + } +#endif + } + goto writeback; + +cannot_emulate: + DPRINTF("Cannot emulate %02x\n", b); + return -1; +} + +#ifdef __XEN__ + +#include +#include + +int +x86_emulate_read_std(unsigned long addr, + unsigned long *val, + unsigned int bytes, struct x86_emulate_ctxt *ctxt) +{ + unsigned int rc; + + *val = 0; + + if ((rc = copy_from_user((void *)val, (void *)addr, bytes)) != 0) { + propagate_page_fault(addr + bytes - rc, 0); /* read fault */ + return X86EMUL_PROPAGATE_FAULT; + } + + return X86EMUL_CONTINUE; +} + +int +x86_emulate_write_std(unsigned long addr, + unsigned long val, + unsigned int bytes, struct x86_emulate_ctxt *ctxt) +{ + unsigned int rc; + + if ((rc = copy_to_user((void *)addr, (void *)&val, bytes)) != 0) { + propagate_page_fault(addr + bytes - rc, PGERR_write_access); + return X86EMUL_PROPAGATE_FAULT; + } + + return X86EMUL_CONTINUE; +} + +#endif Index: linux/drivers/kvm/x86_emulate.h =================================================================== --- /dev/null +++ linux/drivers/kvm/x86_emulate.h @@ -0,0 +1,185 @@ +/****************************************************************************** + * x86_emulate.h + * + * Generic x86 (32-bit and 64-bit) instruction decoder and emulator. + * + * Copyright (c) 2005 Keir Fraser + * + * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4 + */ + +#ifndef __X86_EMULATE_H__ +#define __X86_EMULATE_H__ + +struct x86_emulate_ctxt; + +/* + * x86_emulate_ops: + * + * These operations represent the instruction emulator's interface to memory. + * There are two categories of operation: those that act on ordinary memory + * regions (*_std), and those that act on memory regions known to require + * special treatment or emulation (*_emulated). + * + * The emulator assumes that an instruction accesses only one 'emulated memory' + * location, that this location is the given linear faulting address (cr2), and + * that this is one of the instruction's data operands. Instruction fetches and + * stack operations are assumed never to access emulated memory. The emulator + * automatically deduces which operand of a string-move operation is accessing + * emulated memory, and assumes that the other operand accesses normal memory. + * + * NOTES: + * 1. The emulator isn't very smart about emulated vs. standard memory. + * 'Emulated memory' access addresses should be checked for sanity. + * 'Normal memory' accesses may fault, and the caller must arrange to + * detect and handle reentrancy into the emulator via recursive faults. + * Accesses may be unaligned and may cross page boundaries. + * 2. If the access fails (cannot emulate, or a standard access faults) then + * it is up to the memop to propagate the fault to the guest VM via + * some out-of-band mechanism, unknown to the emulator. The memop signals + * failure by returning X86EMUL_PROPAGATE_FAULT to the emulator, which will + * then immediately bail. + * 3. Valid access sizes are 1, 2, 4 and 8 bytes. On x86/32 systems only + * cmpxchg8b_emulated need support 8-byte accesses. + * 4. The emulator cannot handle 64-bit mode emulation on an x86/32 system. + */ +/* Access completed successfully: continue emulation as normal. */ +#define X86EMUL_CONTINUE 0 +/* Access is unhandleable: bail from emulation and return error to caller. */ +#define X86EMUL_UNHANDLEABLE 1 +/* Terminate emulation but return success to the caller. */ +#define X86EMUL_PROPAGATE_FAULT 2 /* propagate a generated fault to guest */ +#define X86EMUL_RETRY_INSTR 2 /* retry the instruction for some reason */ +#define X86EMUL_CMPXCHG_FAILED 2 /* cmpxchg did not see expected value */ +struct x86_emulate_ops { + /* + * read_std: Read bytes of standard (non-emulated/special) memory. + * Used for instruction fetch, stack operations, and others. + * @addr: [IN ] Linear address from which to read. + * @val: [OUT] Value read from memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to read from memory. + */ + int (*read_std)(unsigned long addr, + unsigned long *val, + unsigned int bytes, struct x86_emulate_ctxt * ctxt); + + /* + * write_std: Write bytes of standard (non-emulated/special) memory. + * Used for stack operations, and others. + * @addr: [IN ] Linear address to which to write. + * @val: [IN ] Value to write to memory (low-order bytes used as + * required). + * @bytes: [IN ] Number of bytes to write to memory. + */ + int (*write_std)(unsigned long addr, + unsigned long val, + unsigned int bytes, struct x86_emulate_ctxt * ctxt); + + /* + * read_emulated: Read bytes from emulated/special memory area. + * @addr: [IN ] Linear address from which to read. + * @val: [OUT] Value read from memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to read from memory. + */ + int (*read_emulated) (unsigned long addr, + unsigned long *val, + unsigned int bytes, + struct x86_emulate_ctxt * ctxt); + + /* + * write_emulated: Read bytes from emulated/special memory area. + * @addr: [IN ] Linear address to which to write. + * @val: [IN ] Value to write to memory (low-order bytes used as + * required). + * @bytes: [IN ] Number of bytes to write to memory. + */ + int (*write_emulated) (unsigned long addr, + unsigned long val, + unsigned int bytes, + struct x86_emulate_ctxt * ctxt); + + /* + * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an + * emulated/special memory area. + * @addr: [IN ] Linear address to access. + * @old: [IN ] Value expected to be current at @addr. + * @new: [IN ] Value to write to @addr. + * @bytes: [IN ] Number of bytes to access using CMPXCHG. + */ + int (*cmpxchg_emulated) (unsigned long addr, + unsigned long old, + unsigned long new, + unsigned int bytes, + struct x86_emulate_ctxt * ctxt); + + /* + * cmpxchg8b_emulated: Emulate an atomic (LOCKed) CMPXCHG8B operation on an + * emulated/special memory area. + * @addr: [IN ] Linear address to access. + * @old: [IN ] Value expected to be current at @addr. + * @new: [IN ] Value to write to @addr. + * NOTES: + * 1. This function is only ever called when emulating a real CMPXCHG8B. + * 2. This function is *never* called on x86/64 systems. + * 2. Not defining this function (i.e., specifying NULL) is equivalent + * to defining a function that always returns X86EMUL_UNHANDLEABLE. + */ + int (*cmpxchg8b_emulated) (unsigned long addr, + unsigned long old_lo, + unsigned long old_hi, + unsigned long new_lo, + unsigned long new_hi, + struct x86_emulate_ctxt * ctxt); +}; + +struct cpu_user_regs; + +struct x86_emulate_ctxt { + /* Register state before/after emulation. */ + struct kvm_vcpu *vcpu; + + /* Linear faulting address (if emulating a page-faulting instruction). */ + unsigned long eflags; + unsigned long cr2; + + /* Emulated execution mode, represented by an X86EMUL_MODE value. */ + int mode; + + unsigned long cs_base; + unsigned long ds_base; + unsigned long es_base; + unsigned long ss_base; + unsigned long gs_base; + unsigned long fs_base; +}; + +/* Execution mode, passed to the emulator. */ +#define X86EMUL_MODE_REAL 0 /* Real mode. */ +#define X86EMUL_MODE_PROT16 2 /* 16-bit protected mode. */ +#define X86EMUL_MODE_PROT32 4 /* 32-bit protected mode. */ +#define X86EMUL_MODE_PROT64 8 /* 64-bit (long) mode. */ + +/* Host execution mode. */ +#if defined(__i386__) +#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32 +#elif defined(__x86_64__) +#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64 +#endif + +/* + * x86_emulate_memop: Emulate an instruction that faulted attempting to + * read/write a 'special' memory area. + * Returns -1 on failure, 0 on success. + */ +int x86_emulate_memop(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops); + +/* + * Given the 'reg' portion of a ModRM byte, and a register block, return a + * pointer into the block that addresses the relevant register. + * @highbyte_regs specifies whether to decode AH,CH,DH,BH. + */ +void *decode_register(u8 modrm_reg, unsigned long *regs, + int highbyte_regs); + +#endif /* __X86_EMULATE_H__ */ Index: linux/drivers/macintosh/adb.c =================================================================== --- linux.orig/drivers/macintosh/adb.c +++ linux/drivers/macintosh/adb.c @@ -256,6 +256,8 @@ adb_probe_task(void *x) sigprocmask(SIG_BLOCK, &blocked, NULL); flush_signals(current); + down(&adb_probe_mutex); + printk(KERN_INFO "adb: starting probe task...\n"); do_adb_reset_bus(); printk(KERN_INFO "adb: finished probe task...\n"); @@ -282,7 +284,9 @@ adb_reset_bus(void) return 0; } - down(&adb_probe_mutex); + if (adb_got_sleep) + return 0; + schedule_work(&adb_reset_work); return 0; } @@ -347,23 +351,21 @@ adb_notify_sleep(struct pmu_sleep_notifi switch (when) { case PBOOK_SLEEP_REQUEST: + /* Signal to discontiue probing */ adb_got_sleep = 1; - /* We need to get a lock on the probe thread */ - down(&adb_probe_mutex); /* Stop autopoll */ if (adb_controller->autopoll) adb_controller->autopoll(0); ret = blocking_notifier_call_chain(&adb_client_list, ADB_MSG_POWERDOWN, NULL); if (ret & NOTIFY_STOP_MASK) { - up(&adb_probe_mutex); + adb_got_sleep = 0; return PBOOK_SLEEP_REFUSE; } break; case PBOOK_SLEEP_REJECT: if (adb_got_sleep) { adb_got_sleep = 0; - up(&adb_probe_mutex); adb_reset_bus(); } break; @@ -372,7 +374,6 @@ adb_notify_sleep(struct pmu_sleep_notifi break; case PBOOK_WAKE: adb_got_sleep = 0; - up(&adb_probe_mutex); adb_reset_bus(); break; } Index: linux/drivers/media/dvb/dvb-core/dvb_frontend.c =================================================================== --- linux.orig/drivers/media/dvb/dvb-core/dvb_frontend.c +++ linux/drivers/media/dvb/dvb-core/dvb_frontend.c @@ -97,7 +97,7 @@ struct dvb_frontend_private { struct dvb_device *dvbdev; struct dvb_frontend_parameters parameters; struct dvb_fe_events events; - struct semaphore sem; + struct compat_semaphore sem; struct list_head list_head; wait_queue_head_t wait_queue; pid_t thread_pid; Index: linux/drivers/media/dvb/dvb-core/dvb_frontend.h =================================================================== --- linux.orig/drivers/media/dvb/dvb-core/dvb_frontend.h +++ linux/drivers/media/dvb/dvb-core/dvb_frontend.h @@ -142,7 +142,7 @@ struct dvb_fe_events { int eventr; int overflow; wait_queue_head_t wait_queue; - struct semaphore sem; + struct compat_semaphore sem; }; struct dvb_frontend { Index: linux/drivers/net/3c527.c =================================================================== --- linux.orig/drivers/net/3c527.c +++ linux/drivers/net/3c527.c @@ -182,7 +182,7 @@ struct mc32_local u16 rx_ring_tail; /* index to rx de-queue end */ - struct semaphore cmd_mutex; /* Serialises issuing of execute commands */ + struct compat_semaphore cmd_mutex; /* Serialises issuing of execute commands */ struct completion execution_cmd; /* Card has completed an execute command */ struct completion xceiver_cmd; /* Card has completed a tx or rx command */ }; Index: linux/drivers/net/3c59x.c =================================================================== --- linux.orig/drivers/net/3c59x.c +++ linux/drivers/net/3c59x.c @@ -793,9 +793,9 @@ static void poll_vortex(struct net_devic struct vortex_private *vp = netdev_priv(dev); unsigned long flags; local_save_flags(flags); - local_irq_disable(); + local_irq_disable_nort(); (vp->full_bus_master_rx ? boomerang_interrupt:vortex_interrupt)(dev->irq,dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #endif @@ -1725,6 +1725,7 @@ vortex_timer(unsigned long data) int next_tick = 60*HZ; int ok = 0; int media_status, old_window; + unsigned long flags; if (vortex_debug > 2) { printk(KERN_DEBUG "%s: Media selection timer tick happened, %s.\n", @@ -1732,7 +1733,7 @@ vortex_timer(unsigned long data) printk(KERN_DEBUG "dev->watchdog_timeo=%d\n", dev->watchdog_timeo); } - disable_irq_lockdep(dev->irq); + spin_lock_irqsave(&vp->lock, flags); old_window = ioread16(ioaddr + EL3_CMD) >> 13; EL3WINDOW(4); media_status = ioread16(ioaddr + Wn4_Media); @@ -1755,9 +1756,7 @@ vortex_timer(unsigned long data) case XCVR_MII: case XCVR_NWAY: { ok = 1; - spin_lock_bh(&vp->lock); vortex_check_media(dev, 0); - spin_unlock_bh(&vp->lock); } break; default: /* Other media types handled by Tx timeouts. */ @@ -1813,7 +1812,7 @@ leave_media_alone: dev->name, media_tbl[dev->if_port].name); EL3WINDOW(old_window); - enable_irq_lockdep(dev->irq); + spin_unlock_irqrestore(&vp->lock, flags); mod_timer(&vp->timer, RUN_AT(next_tick)); if (vp->deferred) iowrite16(FakeIntr, ioaddr + EL3_CMD); @@ -1846,13 +1845,17 @@ static void vortex_tx_timeout(struct net /* * Block interrupts because vortex_interrupt does a bare spin_lock() */ +#ifndef CONFIG_PREEMPT_RT unsigned long flags; local_irq_save(flags); +#endif if (vp->full_bus_master_tx) boomerang_interrupt(dev->irq, dev); else vortex_interrupt(dev->irq, dev); +#ifndef CONFIG_PREEMPT_RT local_irq_restore(flags); +#endif } } Index: linux/drivers/net/8139too.c =================================================================== --- linux.orig/drivers/net/8139too.c +++ linux/drivers/net/8139too.c @@ -2216,7 +2216,11 @@ static irqreturn_t rtl8139_interrupt (in */ static void rtl8139_poll_controller(struct net_device *dev) { - disable_irq(dev->irq); + /* + * use _nosync() variant - might be used by netconsole + * from atomic contexts: + */ + disable_irq_nosync(dev->irq); rtl8139_interrupt(dev->irq, dev); enable_irq(dev->irq); } Index: linux/drivers/net/e1000/e1000.h =================================================================== --- linux.orig/drivers/net/e1000/e1000.h +++ linux/drivers/net/e1000/e1000.h @@ -59,6 +59,9 @@ #include #include #include +#ifdef NETIF_F_TSO6 +#include +#endif #include #include #include @@ -254,6 +257,17 @@ struct e1000_adapter { spinlock_t tx_queue_lock; #endif atomic_t irq_sem; + unsigned int detect_link; + unsigned int total_tx_bytes; + unsigned int total_tx_packets; + unsigned int total_rx_bytes; + unsigned int total_rx_packets; + /* Interrupt Throttle Rate */ + uint32_t itr; + uint32_t itr_setting; + uint16_t tx_itr; + uint16_t rx_itr; + struct work_struct reset_task; uint8_t fc_autoneg; @@ -262,6 +276,7 @@ struct e1000_adapter { /* TX */ struct e1000_tx_ring *tx_ring; /* One per active queue */ + unsigned int restart_queue; unsigned long tx_queue_len; uint32_t txd_cmd; uint32_t tx_int_delay; @@ -310,8 +325,6 @@ struct e1000_adapter { uint64_t gorcl_old; uint16_t rx_ps_bsize0; - /* Interrupt Throttle Rate */ - uint32_t itr; /* OS defined structs */ struct net_device *netdev; Index: linux/drivers/net/e1000/e1000_ethtool.c =================================================================== --- linux.orig/drivers/net/e1000/e1000_ethtool.c +++ linux/drivers/net/e1000/e1000_ethtool.c @@ -85,6 +85,7 @@ static const struct e1000_stats e1000_gs { "tx_single_coll_ok", E1000_STAT(stats.scc) }, { "tx_multi_coll_ok", E1000_STAT(stats.mcc) }, { "tx_timeout_count", E1000_STAT(tx_timeout_count) }, + { "tx_restart_queue", E1000_STAT(restart_queue) }, { "rx_long_length_errors", E1000_STAT(stats.roc) }, { "rx_short_length_errors", E1000_STAT(stats.ruc) }, { "rx_align_errors", E1000_STAT(stats.algnerrc) }, @@ -133,9 +134,7 @@ e1000_get_settings(struct net_device *ne if (hw->autoneg == 1) { ecmd->advertising |= ADVERTISED_Autoneg; - /* the e1000 autoneg seems to match ethtool nicely */ - ecmd->advertising |= hw->autoneg_advertised; } @@ -285,7 +284,7 @@ e1000_set_pauseparam(struct net_device * e1000_reset(adapter); } else retval = ((hw->media_type == e1000_media_type_fiber) ? - e1000_setup_link(hw) : e1000_force_mac_fc(hw)); + e1000_setup_link(hw) : e1000_force_mac_fc(hw)); clear_bit(__E1000_RESETTING, &adapter->flags); return retval; @@ -350,6 +349,13 @@ e1000_set_tso(struct net_device *netdev, else netdev->features &= ~NETIF_F_TSO; +#ifdef NETIF_F_TSO6 + if (data) + netdev->features |= NETIF_F_TSO6; + else + netdev->features &= ~NETIF_F_TSO6; +#endif + DPRINTK(PROBE, INFO, "TSO is %s\n", data ? "Enabled" : "Disabled"); adapter->tso_force = TRUE; return 0; @@ -774,7 +780,7 @@ e1000_reg_test(struct e1000_adapter *ada /* The status register is Read Only, so a write should fail. * Some bits that get toggled are ignored. */ - switch (adapter->hw.mac_type) { + switch (adapter->hw.mac_type) { /* there are several bits on newer hardware that are r/w */ case e1000_82571: case e1000_82572: @@ -802,12 +808,14 @@ e1000_reg_test(struct e1000_adapter *ada } /* restore previous status */ E1000_WRITE_REG(&adapter->hw, STATUS, before); + if (adapter->hw.mac_type != e1000_ich8lan) { REG_PATTERN_TEST(FCAL, 0xFFFFFFFF, 0xFFFFFFFF); REG_PATTERN_TEST(FCAH, 0x0000FFFF, 0xFFFFFFFF); REG_PATTERN_TEST(FCT, 0x0000FFFF, 0xFFFFFFFF); REG_PATTERN_TEST(VET, 0x0000FFFF, 0xFFFFFFFF); } + REG_PATTERN_TEST(RDTR, 0x0000FFFF, 0xFFFFFFFF); REG_PATTERN_TEST(RDBAH, 0xFFFFFFFF, 0xFFFFFFFF); REG_PATTERN_TEST(RDLEN, 0x000FFF80, 0x000FFFFF); @@ -820,8 +828,9 @@ e1000_reg_test(struct e1000_adapter *ada REG_PATTERN_TEST(TDLEN, 0x000FFF80, 0x000FFFFF); REG_SET_AND_CHECK(RCTL, 0xFFFFFFFF, 0x00000000); + before = (adapter->hw.mac_type == e1000_ich8lan ? - 0x06C3B33E : 0x06DFB3FE); + 0x06C3B33E : 0x06DFB3FE); REG_SET_AND_CHECK(RCTL, before, 0x003FFFFB); REG_SET_AND_CHECK(TCTL, 0xFFFFFFFF, 0x00000000); @@ -834,10 +843,10 @@ e1000_reg_test(struct e1000_adapter *ada REG_PATTERN_TEST(TDBAL, 0xFFFFFFF0, 0xFFFFFFFF); REG_PATTERN_TEST(TIDV, 0x0000FFFF, 0x0000FFFF); value = (adapter->hw.mac_type == e1000_ich8lan ? - E1000_RAR_ENTRIES_ICH8LAN : E1000_RAR_ENTRIES); + E1000_RAR_ENTRIES_ICH8LAN : E1000_RAR_ENTRIES); for (i = 0; i < value; i++) { REG_PATTERN_TEST(RA + (((i << 1) + 1) << 2), 0x8003FFFF, - 0xFFFFFFFF); + 0xFFFFFFFF); } } else { @@ -883,8 +892,7 @@ e1000_eeprom_test(struct e1000_adapter * } static irqreturn_t -e1000_test_intr(int irq, - void *data) +e1000_test_intr(int irq, void *data) { struct net_device *netdev = (struct net_device *) data; struct e1000_adapter *adapter = netdev_priv(netdev); @@ -905,11 +913,11 @@ e1000_intr_test(struct e1000_adapter *ad /* NOTE: we don't test MSI interrupts here, yet */ /* Hook up test interrupt handler just for this test */ - if (!request_irq(irq, &e1000_test_intr, IRQF_PROBE_SHARED, - netdev->name, netdev)) + if (!request_irq(irq, &e1000_test_intr, IRQF_PROBE_SHARED, netdev->name, + netdev)) shared_int = FALSE; else if (request_irq(irq, &e1000_test_intr, IRQF_SHARED, - netdev->name, netdev)) { + netdev->name, netdev)) { *data = 1; return -1; } @@ -925,6 +933,7 @@ e1000_intr_test(struct e1000_adapter *ad if (adapter->hw.mac_type == e1000_ich8lan && i == 8) continue; + /* Interrupt to test */ mask = 1 << i; @@ -1674,7 +1683,7 @@ e1000_diag_test(struct net_device *netde if (e1000_link_test(adapter, &data[4])) eth_test->flags |= ETH_TEST_FL_FAILED; - /* Offline tests aren't run; pass by default */ + /* Online tests aren't run; pass by default */ data[0] = 0; data[1] = 0; data[2] = 0; @@ -1717,6 +1726,7 @@ static int e1000_wol_exclusion(struct e1 retval = 0; break; case E1000_DEV_ID_82571EB_QUAD_COPPER: + case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: case E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3: /* quad port adapters only support WoL on port A */ if (!adapter->quad_port_a) { Index: linux/drivers/net/e1000/e1000_hw.c =================================================================== --- linux.orig/drivers/net/e1000/e1000_hw.c +++ linux/drivers/net/e1000/e1000_hw.c @@ -385,6 +385,7 @@ e1000_set_mac_type(struct e1000_hw *hw) case E1000_DEV_ID_82571EB_FIBER: case E1000_DEV_ID_82571EB_SERDES: case E1000_DEV_ID_82571EB_QUAD_COPPER: + case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: hw->mac_type = e1000_82571; break; case E1000_DEV_ID_82572EI_COPPER: @@ -408,6 +409,8 @@ e1000_set_mac_type(struct e1000_hw *hw) case E1000_DEV_ID_ICH8_IGP_AMT: case E1000_DEV_ID_ICH8_IGP_C: case E1000_DEV_ID_ICH8_IFE: + case E1000_DEV_ID_ICH8_IFE_GT: + case E1000_DEV_ID_ICH8_IFE_G: case E1000_DEV_ID_ICH8_IGP_M: hw->mac_type = e1000_ich8lan; break; @@ -2367,6 +2370,7 @@ e1000_phy_force_speed_duplex(struct e100 /* Need to reset the PHY or these changes will be ignored */ mii_ctrl_reg |= MII_CR_RESET; + /* Disable MDI-X support for 10/100 */ } else if (hw->phy_type == e1000_phy_ife) { ret_val = e1000_read_phy_reg(hw, IFE_PHY_MDIX_CONTROL, &phy_data); @@ -2379,6 +2383,7 @@ e1000_phy_force_speed_duplex(struct e100 ret_val = e1000_write_phy_reg(hw, IFE_PHY_MDIX_CONTROL, phy_data); if (ret_val) return ret_val; + } else { /* Clear Auto-Crossover to force MDI manually. IGP requires MDI * forced whenever speed or duplex are forced. @@ -3868,7 +3873,7 @@ e1000_phy_hw_reset(struct e1000_hw *hw) * * hw - Struct containing variables accessed by shared code * -* Sets bit 15 of the MII Control regiser +* Sets bit 15 of the MII Control register ******************************************************************************/ int32_t e1000_phy_reset(struct e1000_hw *hw) @@ -3940,14 +3945,15 @@ e1000_phy_powerdown_workaround(struct e1 E1000_WRITE_REG(hw, PHY_CTRL, reg | E1000_PHY_CTRL_GBE_DISABLE | E1000_PHY_CTRL_NOND0A_GBE_DISABLE); - /* Write VR power-down enable */ + /* Write VR power-down enable - bits 9:8 should be 10b */ e1000_read_phy_reg(hw, IGP3_VR_CTRL, &phy_data); - e1000_write_phy_reg(hw, IGP3_VR_CTRL, phy_data | - IGP3_VR_CTRL_MODE_SHUT); + phy_data |= (1 << 9); + phy_data &= ~(1 << 8); + e1000_write_phy_reg(hw, IGP3_VR_CTRL, phy_data); /* Read it back and test */ e1000_read_phy_reg(hw, IGP3_VR_CTRL, &phy_data); - if ((phy_data & IGP3_VR_CTRL_MODE_SHUT) || retry) + if (((phy_data & IGP3_VR_CTRL_MODE_MASK) == IGP3_VR_CTRL_MODE_SHUT) || retry) break; /* Issue PHY reset and repeat at most one more time */ @@ -4549,7 +4555,7 @@ e1000_init_eeprom_params(struct e1000_hw case e1000_ich8lan: { int32_t i = 0; - uint32_t flash_size = E1000_READ_ICH8_REG(hw, ICH8_FLASH_GFPREG); + uint32_t flash_size = E1000_READ_ICH_FLASH_REG(hw, ICH_FLASH_GFPREG); eeprom->type = e1000_eeprom_ich8; eeprom->use_eerd = FALSE; @@ -4565,12 +4571,14 @@ e1000_init_eeprom_params(struct e1000_hw } } - hw->flash_base_addr = (flash_size & ICH8_GFPREG_BASE_MASK) * - ICH8_FLASH_SECTOR_SIZE; + hw->flash_base_addr = (flash_size & ICH_GFPREG_BASE_MASK) * + ICH_FLASH_SECTOR_SIZE; + + hw->flash_bank_size = ((flash_size >> 16) & ICH_GFPREG_BASE_MASK) + 1; + hw->flash_bank_size -= (flash_size & ICH_GFPREG_BASE_MASK); + + hw->flash_bank_size *= ICH_FLASH_SECTOR_SIZE; - hw->flash_bank_size = ((flash_size >> 16) & ICH8_GFPREG_BASE_MASK) + 1; - hw->flash_bank_size -= (flash_size & ICH8_GFPREG_BASE_MASK); - hw->flash_bank_size *= ICH8_FLASH_SECTOR_SIZE; hw->flash_bank_size /= 2 * sizeof(uint16_t); break; @@ -5620,8 +5628,8 @@ e1000_commit_shadow_ram(struct e1000_hw * signature is valid. We want to do this after the write * has completed so that we don't mark the segment valid * while the write is still in progress */ - if (i == E1000_ICH8_NVM_SIG_WORD) - high_byte = E1000_ICH8_NVM_SIG_MASK | high_byte; + if (i == E1000_ICH_NVM_SIG_WORD) + high_byte = E1000_ICH_NVM_SIG_MASK | high_byte; error = e1000_verify_write_ich8_byte(hw, (i << 1) + new_bank_offset + 1, high_byte); @@ -5643,18 +5651,18 @@ e1000_commit_shadow_ram(struct e1000_hw * erase as well since these bits are 11 to start with * and we need to change bit 14 to 0b */ e1000_read_ich8_byte(hw, - E1000_ICH8_NVM_SIG_WORD * 2 + 1 + new_bank_offset, + E1000_ICH_NVM_SIG_WORD * 2 + 1 + new_bank_offset, &high_byte); high_byte &= 0xBF; error = e1000_verify_write_ich8_byte(hw, - E1000_ICH8_NVM_SIG_WORD * 2 + 1 + new_bank_offset, high_byte); + E1000_ICH_NVM_SIG_WORD * 2 + 1 + new_bank_offset, high_byte); /* And invalidate the previously valid segment by setting * its signature word (0x13) high_byte to 0b. This can be * done without an erase because flash erase sets all bits * to 1's. We can write 1's to 0's without an erase */ if (error == E1000_SUCCESS) { error = e1000_verify_write_ich8_byte(hw, - E1000_ICH8_NVM_SIG_WORD * 2 + 1 + old_bank_offset, 0); + E1000_ICH_NVM_SIG_WORD * 2 + 1 + old_bank_offset, 0); } /* Clear the now not used entry in the cache */ @@ -5841,6 +5849,7 @@ e1000_mta_set(struct e1000_hw *hw, hash_reg = (hash_value >> 5) & 0x7F; if (hw->mac_type == e1000_ich8lan) hash_reg &= 0x1F; + hash_bit = hash_value & 0x1F; mta = E1000_READ_REG_ARRAY(hw, MTA, hash_reg); @@ -6026,6 +6035,7 @@ e1000_id_led_init(struct e1000_hw * hw) else eeprom_data = ID_LED_DEFAULT; } + for (i = 0; i < 4; i++) { temp = (eeprom_data >> (i << 2)) & led_mask; switch (temp) { @@ -8486,7 +8496,7 @@ e1000_ich8_cycle_init(struct e1000_hw *h DEBUGFUNC("e1000_ich8_cycle_init"); - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); /* May be check the Flash Des Valid bit in Hw status */ if (hsfsts.hsf_status.fldesvalid == 0) { @@ -8499,7 +8509,7 @@ e1000_ich8_cycle_init(struct e1000_hw *h hsfsts.hsf_status.flcerr = 1; hsfsts.hsf_status.dael = 1; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFSTS, hsfsts.regval); + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS, hsfsts.regval); /* Either we should have a hardware SPI cycle in progress bit to check * against, in order to start a new cycle or FDONE bit should be changed @@ -8514,13 +8524,13 @@ e1000_ich8_cycle_init(struct e1000_hw *h /* There is no cycle running at present, so we can start a cycle */ /* Begin by setting Flash Cycle Done. */ hsfsts.hsf_status.flcdone = 1; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFSTS, hsfsts.regval); + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS, hsfsts.regval); error = E1000_SUCCESS; } else { /* otherwise poll for sometime so the current cycle has a chance * to end before giving up. */ - for (i = 0; i < ICH8_FLASH_COMMAND_TIMEOUT; i++) { - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + for (i = 0; i < ICH_FLASH_COMMAND_TIMEOUT; i++) { + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); if (hsfsts.hsf_status.flcinprog == 0) { error = E1000_SUCCESS; break; @@ -8531,7 +8541,7 @@ e1000_ich8_cycle_init(struct e1000_hw *h /* Successful in waiting for previous cycle to timeout, * now set the Flash Cycle Done. */ hsfsts.hsf_status.flcdone = 1; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFSTS, hsfsts.regval); + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS, hsfsts.regval); } else { DEBUGOUT("Flash controller busy, cannot get access"); } @@ -8553,13 +8563,13 @@ e1000_ich8_flash_cycle(struct e1000_hw * uint32_t i = 0; /* Start a cycle by writing 1 in Flash Cycle Go in Hw Flash Control */ - hsflctl.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFCTL); + hsflctl.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL); hsflctl.hsf_ctrl.flcgo = 1; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFCTL, hsflctl.regval); + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL, hsflctl.regval); /* wait till FDONE bit is set to 1 */ do { - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); if (hsfsts.hsf_status.flcdone == 1) break; udelay(1); @@ -8593,10 +8603,10 @@ e1000_read_ich8_data(struct e1000_hw *hw DEBUGFUNC("e1000_read_ich8_data"); if (size < 1 || size > 2 || data == 0x0 || - index > ICH8_FLASH_LINEAR_ADDR_MASK) + index > ICH_FLASH_LINEAR_ADDR_MASK) return error; - flash_linear_address = (ICH8_FLASH_LINEAR_ADDR_MASK & index) + + flash_linear_address = (ICH_FLASH_LINEAR_ADDR_MASK & index) + hw->flash_base_addr; do { @@ -8606,25 +8616,25 @@ e1000_read_ich8_data(struct e1000_hw *hw if (error != E1000_SUCCESS) break; - hsflctl.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFCTL); + hsflctl.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL); /* 0b/1b corresponds to 1 or 2 byte size, respectively. */ hsflctl.hsf_ctrl.fldbcount = size - 1; - hsflctl.hsf_ctrl.flcycle = ICH8_CYCLE_READ; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFCTL, hsflctl.regval); + hsflctl.hsf_ctrl.flcycle = ICH_CYCLE_READ; + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL, hsflctl.regval); /* Write the last 24 bits of index into Flash Linear address field in * Flash Address */ /* TODO: TBD maybe check the index against the size of flash */ - E1000_WRITE_ICH8_REG(hw, ICH8_FLASH_FADDR, flash_linear_address); + E1000_WRITE_ICH_FLASH_REG(hw, ICH_FLASH_FADDR, flash_linear_address); - error = e1000_ich8_flash_cycle(hw, ICH8_FLASH_COMMAND_TIMEOUT); + error = e1000_ich8_flash_cycle(hw, ICH_FLASH_COMMAND_TIMEOUT); /* Check if FCERR is set to 1, if set to 1, clear it and try the whole * sequence a few more times, else read in (shift in) the Flash Data0, * the order is least significant byte first msb to lsb */ if (error == E1000_SUCCESS) { - flash_data = E1000_READ_ICH8_REG(hw, ICH8_FLASH_FDATA0); + flash_data = E1000_READ_ICH_FLASH_REG(hw, ICH_FLASH_FDATA0); if (size == 1) { *data = (uint8_t)(flash_data & 0x000000FF); } else if (size == 2) { @@ -8634,9 +8644,9 @@ e1000_read_ich8_data(struct e1000_hw *hw } else { /* If we've gotten here, then things are probably completely hosed, * but if the error condition is detected, it won't hurt to give - * it another try...ICH8_FLASH_CYCLE_REPEAT_COUNT times. + * it another try...ICH_FLASH_CYCLE_REPEAT_COUNT times. */ - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); if (hsfsts.hsf_status.flcerr == 1) { /* Repeat for some time before giving up. */ continue; @@ -8645,7 +8655,7 @@ e1000_read_ich8_data(struct e1000_hw *hw break; } } - } while (count++ < ICH8_FLASH_CYCLE_REPEAT_COUNT); + } while (count++ < ICH_FLASH_CYCLE_REPEAT_COUNT); return error; } @@ -8672,10 +8682,10 @@ e1000_write_ich8_data(struct e1000_hw *h DEBUGFUNC("e1000_write_ich8_data"); if (size < 1 || size > 2 || data > size * 0xff || - index > ICH8_FLASH_LINEAR_ADDR_MASK) + index > ICH_FLASH_LINEAR_ADDR_MASK) return error; - flash_linear_address = (ICH8_FLASH_LINEAR_ADDR_MASK & index) + + flash_linear_address = (ICH_FLASH_LINEAR_ADDR_MASK & index) + hw->flash_base_addr; do { @@ -8685,34 +8695,34 @@ e1000_write_ich8_data(struct e1000_hw *h if (error != E1000_SUCCESS) break; - hsflctl.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFCTL); + hsflctl.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL); /* 0b/1b corresponds to 1 or 2 byte size, respectively. */ hsflctl.hsf_ctrl.fldbcount = size -1; - hsflctl.hsf_ctrl.flcycle = ICH8_CYCLE_WRITE; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFCTL, hsflctl.regval); + hsflctl.hsf_ctrl.flcycle = ICH_CYCLE_WRITE; + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL, hsflctl.regval); /* Write the last 24 bits of index into Flash Linear address field in * Flash Address */ - E1000_WRITE_ICH8_REG(hw, ICH8_FLASH_FADDR, flash_linear_address); + E1000_WRITE_ICH_FLASH_REG(hw, ICH_FLASH_FADDR, flash_linear_address); if (size == 1) flash_data = (uint32_t)data & 0x00FF; else flash_data = (uint32_t)data; - E1000_WRITE_ICH8_REG(hw, ICH8_FLASH_FDATA0, flash_data); + E1000_WRITE_ICH_FLASH_REG(hw, ICH_FLASH_FDATA0, flash_data); /* check if FCERR is set to 1 , if set to 1, clear it and try the whole * sequence a few more times else done */ - error = e1000_ich8_flash_cycle(hw, ICH8_FLASH_COMMAND_TIMEOUT); + error = e1000_ich8_flash_cycle(hw, ICH_FLASH_COMMAND_TIMEOUT); if (error == E1000_SUCCESS) { break; } else { /* If we're here, then things are most likely completely hosed, * but if the error condition is detected, it won't hurt to give - * it another try...ICH8_FLASH_CYCLE_REPEAT_COUNT times. + * it another try...ICH_FLASH_CYCLE_REPEAT_COUNT times. */ - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); if (hsfsts.hsf_status.flcerr == 1) { /* Repeat for some time before giving up. */ continue; @@ -8721,7 +8731,7 @@ e1000_write_ich8_data(struct e1000_hw *h break; } } - } while (count++ < ICH8_FLASH_CYCLE_REPEAT_COUNT); + } while (count++ < ICH_FLASH_CYCLE_REPEAT_COUNT); return error; } @@ -8840,7 +8850,7 @@ e1000_erase_ich8_4k_segment(struct e1000 int32_t j = 0; int32_t error_flag = 0; - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); /* Determine HW Sector size: Read BERASE bits of Hw flash Status register */ /* 00: The Hw sector is 256 bytes, hence we need to erase 16 @@ -8853,19 +8863,14 @@ e1000_erase_ich8_4k_segment(struct e1000 * 11: The Hw sector size is 64K bytes */ if (hsfsts.hsf_status.berasesz == 0x0) { /* Hw sector size 256 */ - sub_sector_size = ICH8_FLASH_SEG_SIZE_256; - bank_size = ICH8_FLASH_SECTOR_SIZE; - iteration = ICH8_FLASH_SECTOR_SIZE / ICH8_FLASH_SEG_SIZE_256; + sub_sector_size = ICH_FLASH_SEG_SIZE_256; + bank_size = ICH_FLASH_SECTOR_SIZE; + iteration = ICH_FLASH_SECTOR_SIZE / ICH_FLASH_SEG_SIZE_256; } else if (hsfsts.hsf_status.berasesz == 0x1) { - bank_size = ICH8_FLASH_SEG_SIZE_4K; - iteration = 1; - } else if (hw->mac_type != e1000_ich8lan && - hsfsts.hsf_status.berasesz == 0x2) { - /* 8K erase size invalid for ICH8 - added in for ICH9 */ - bank_size = ICH9_FLASH_SEG_SIZE_8K; + bank_size = ICH_FLASH_SEG_SIZE_4K; iteration = 1; } else if (hsfsts.hsf_status.berasesz == 0x3) { - bank_size = ICH8_FLASH_SEG_SIZE_64K; + bank_size = ICH_FLASH_SEG_SIZE_64K; iteration = 1; } else { return error; @@ -8883,9 +8888,9 @@ e1000_erase_ich8_4k_segment(struct e1000 /* Write a value 11 (block Erase) in Flash Cycle field in Hw flash * Control */ - hsflctl.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFCTL); - hsflctl.hsf_ctrl.flcycle = ICH8_CYCLE_ERASE; - E1000_WRITE_ICH8_REG16(hw, ICH8_FLASH_HSFCTL, hsflctl.regval); + hsflctl.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL); + hsflctl.hsf_ctrl.flcycle = ICH_CYCLE_ERASE; + E1000_WRITE_ICH_FLASH_REG16(hw, ICH_FLASH_HSFCTL, hsflctl.regval); /* Write the last 24 bits of an index within the block into Flash * Linear address field in Flash Address. This probably needs to @@ -8893,17 +8898,17 @@ e1000_erase_ich8_4k_segment(struct e1000 * the software bank size (4, 8 or 64 KBytes) */ flash_linear_address = bank * bank_size + j * sub_sector_size; flash_linear_address += hw->flash_base_addr; - flash_linear_address &= ICH8_FLASH_LINEAR_ADDR_MASK; + flash_linear_address &= ICH_FLASH_LINEAR_ADDR_MASK; - E1000_WRITE_ICH8_REG(hw, ICH8_FLASH_FADDR, flash_linear_address); + E1000_WRITE_ICH_FLASH_REG(hw, ICH_FLASH_FADDR, flash_linear_address); - error = e1000_ich8_flash_cycle(hw, ICH8_FLASH_ERASE_TIMEOUT); + error = e1000_ich8_flash_cycle(hw, ICH_FLASH_ERASE_TIMEOUT); /* Check if FCERR is set to 1. If 1, clear it and try the whole * sequence a few more times else Done */ if (error == E1000_SUCCESS) { break; } else { - hsfsts.regval = E1000_READ_ICH8_REG16(hw, ICH8_FLASH_HSFSTS); + hsfsts.regval = E1000_READ_ICH_FLASH_REG16(hw, ICH_FLASH_HSFSTS); if (hsfsts.hsf_status.flcerr == 1) { /* repeat for some time before giving up */ continue; @@ -8912,7 +8917,7 @@ e1000_erase_ich8_4k_segment(struct e1000 break; } } - } while ((count < ICH8_FLASH_CYCLE_REPEAT_COUNT) && !error_flag); + } while ((count < ICH_FLASH_CYCLE_REPEAT_COUNT) && !error_flag); if (error_flag == 1) break; } @@ -9013,5 +9018,3 @@ e1000_init_lcd_from_nvm(struct e1000_hw return E1000_SUCCESS; } - - Index: linux/drivers/net/e1000/e1000_hw.h =================================================================== --- linux.orig/drivers/net/e1000/e1000_hw.h +++ linux/drivers/net/e1000/e1000_hw.h @@ -128,11 +128,13 @@ typedef enum { /* PCI bus widths */ typedef enum { e1000_bus_width_unknown = 0, + /* These PCIe values should literally match the possible return values + * from config space */ + e1000_bus_width_pciex_1 = 1, + e1000_bus_width_pciex_2 = 2, + e1000_bus_width_pciex_4 = 4, e1000_bus_width_32, e1000_bus_width_64, - e1000_bus_width_pciex_1, - e1000_bus_width_pciex_2, - e1000_bus_width_pciex_4, e1000_bus_width_reserved } e1000_bus_width; @@ -326,6 +328,7 @@ int32_t e1000_phy_hw_reset(struct e1000_ int32_t e1000_phy_reset(struct e1000_hw *hw); int32_t e1000_phy_get_info(struct e1000_hw *hw, struct e1000_phy_info *phy_info); int32_t e1000_validate_mdi_setting(struct e1000_hw *hw); + void e1000_phy_powerdown_workaround(struct e1000_hw *hw); /* EEPROM Functions */ @@ -390,7 +393,6 @@ int32_t e1000_mng_write_dhcp_info(struct uint16_t length); boolean_t e1000_check_mng_mode(struct e1000_hw *hw); boolean_t e1000_enable_tx_pkt_filtering(struct e1000_hw *hw); - int32_t e1000_read_eeprom(struct e1000_hw *hw, uint16_t reg, uint16_t words, uint16_t *data); int32_t e1000_validate_eeprom_checksum(struct e1000_hw *hw); int32_t e1000_update_eeprom_checksum(struct e1000_hw *hw); @@ -473,6 +475,7 @@ int32_t e1000_check_phy_reset_block(stru #define E1000_DEV_ID_82571EB_FIBER 0x105F #define E1000_DEV_ID_82571EB_SERDES 0x1060 #define E1000_DEV_ID_82571EB_QUAD_COPPER 0x10A4 +#define E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE 0x10BC #define E1000_DEV_ID_82572EI_COPPER 0x107D #define E1000_DEV_ID_82572EI_FIBER 0x107E #define E1000_DEV_ID_82572EI_SERDES 0x107F @@ -490,6 +493,8 @@ int32_t e1000_check_phy_reset_block(stru #define E1000_DEV_ID_ICH8_IGP_AMT 0x104A #define E1000_DEV_ID_ICH8_IGP_C 0x104B #define E1000_DEV_ID_ICH8_IFE 0x104C +#define E1000_DEV_ID_ICH8_IFE_GT 0x10C4 +#define E1000_DEV_ID_ICH8_IFE_G 0x10C5 #define E1000_DEV_ID_ICH8_IGP_M 0x104D @@ -576,6 +581,7 @@ int32_t e1000_check_phy_reset_block(stru * E1000_RAR_ENTRIES - 1 multicast addresses. */ #define E1000_RAR_ENTRIES 15 + #define E1000_RAR_ENTRIES_ICH8LAN 6 #define MIN_NUMBER_OF_DESCRIPTORS 8 @@ -1335,9 +1341,9 @@ struct e1000_hw_stats { uint64_t gotch; uint64_t rnbc; uint64_t ruc; + uint64_t rfc; uint64_t roc; uint64_t rlerrc; - uint64_t rfc; uint64_t rjc; uint64_t mgprc; uint64_t mgpdc; @@ -1577,8 +1583,8 @@ struct e1000_hw { #define E1000_HICR_FW_RESET 0xC0 #define E1000_SHADOW_RAM_WORDS 2048 -#define E1000_ICH8_NVM_SIG_WORD 0x13 -#define E1000_ICH8_NVM_SIG_MASK 0xC0 +#define E1000_ICH_NVM_SIG_WORD 0x13 +#define E1000_ICH_NVM_SIG_MASK 0xC0 /* EEPROM Read */ #define E1000_EERD_START 0x00000001 /* Start Read */ @@ -3172,6 +3178,7 @@ struct e1000_host_command_info { #define IGP3_VR_CTRL \ PHY_REG(776, 18) /* Voltage regulator control register */ #define IGP3_VR_CTRL_MODE_SHUT 0x0200 /* Enter powerdown, shutdown VRs */ +#define IGP3_VR_CTRL_MODE_MASK 0x0300 /* Shutdown VR Mask */ #define IGP3_CAPABILITY \ PHY_REG(776, 19) /* IGP3 Capability Register */ @@ -3256,41 +3263,40 @@ struct e1000_host_command_info { #define IFE_PSCL_PROBE_LEDS_OFF 0x0006 /* Force LEDs 0 and 2 off */ #define IFE_PSCL_PROBE_LEDS_ON 0x0007 /* Force LEDs 0 and 2 on */ -#define ICH8_FLASH_COMMAND_TIMEOUT 5000 /* 5000 uSecs - adjusted */ -#define ICH8_FLASH_ERASE_TIMEOUT 3000000 /* Up to 3 seconds - worst case */ -#define ICH8_FLASH_CYCLE_REPEAT_COUNT 10 /* 10 cycles */ -#define ICH8_FLASH_SEG_SIZE_256 256 -#define ICH8_FLASH_SEG_SIZE_4K 4096 -#define ICH9_FLASH_SEG_SIZE_8K 8192 -#define ICH8_FLASH_SEG_SIZE_64K 65536 - -#define ICH8_CYCLE_READ 0x0 -#define ICH8_CYCLE_RESERVED 0x1 -#define ICH8_CYCLE_WRITE 0x2 -#define ICH8_CYCLE_ERASE 0x3 - -#define ICH8_FLASH_GFPREG 0x0000 -#define ICH8_FLASH_HSFSTS 0x0004 -#define ICH8_FLASH_HSFCTL 0x0006 -#define ICH8_FLASH_FADDR 0x0008 -#define ICH8_FLASH_FDATA0 0x0010 -#define ICH8_FLASH_FRACC 0x0050 -#define ICH8_FLASH_FREG0 0x0054 -#define ICH8_FLASH_FREG1 0x0058 -#define ICH8_FLASH_FREG2 0x005C -#define ICH8_FLASH_FREG3 0x0060 -#define ICH8_FLASH_FPR0 0x0074 -#define ICH8_FLASH_FPR1 0x0078 -#define ICH8_FLASH_SSFSTS 0x0090 -#define ICH8_FLASH_SSFCTL 0x0092 -#define ICH8_FLASH_PREOP 0x0094 -#define ICH8_FLASH_OPTYPE 0x0096 -#define ICH8_FLASH_OPMENU 0x0098 - -#define ICH8_FLASH_REG_MAPSIZE 0x00A0 -#define ICH8_FLASH_SECTOR_SIZE 4096 -#define ICH8_GFPREG_BASE_MASK 0x1FFF -#define ICH8_FLASH_LINEAR_ADDR_MASK 0x00FFFFFF +#define ICH_FLASH_COMMAND_TIMEOUT 5000 /* 5000 uSecs - adjusted */ +#define ICH_FLASH_ERASE_TIMEOUT 3000000 /* Up to 3 seconds - worst case */ +#define ICH_FLASH_CYCLE_REPEAT_COUNT 10 /* 10 cycles */ +#define ICH_FLASH_SEG_SIZE_256 256 +#define ICH_FLASH_SEG_SIZE_4K 4096 +#define ICH_FLASH_SEG_SIZE_64K 65536 + +#define ICH_CYCLE_READ 0x0 +#define ICH_CYCLE_RESERVED 0x1 +#define ICH_CYCLE_WRITE 0x2 +#define ICH_CYCLE_ERASE 0x3 + +#define ICH_FLASH_GFPREG 0x0000 +#define ICH_FLASH_HSFSTS 0x0004 +#define ICH_FLASH_HSFCTL 0x0006 +#define ICH_FLASH_FADDR 0x0008 +#define ICH_FLASH_FDATA0 0x0010 +#define ICH_FLASH_FRACC 0x0050 +#define ICH_FLASH_FREG0 0x0054 +#define ICH_FLASH_FREG1 0x0058 +#define ICH_FLASH_FREG2 0x005C +#define ICH_FLASH_FREG3 0x0060 +#define ICH_FLASH_FPR0 0x0074 +#define ICH_FLASH_FPR1 0x0078 +#define ICH_FLASH_SSFSTS 0x0090 +#define ICH_FLASH_SSFCTL 0x0092 +#define ICH_FLASH_PREOP 0x0094 +#define ICH_FLASH_OPTYPE 0x0096 +#define ICH_FLASH_OPMENU 0x0098 + +#define ICH_FLASH_REG_MAPSIZE 0x00A0 +#define ICH_FLASH_SECTOR_SIZE 4096 +#define ICH_GFPREG_BASE_MASK 0x1FFF +#define ICH_FLASH_LINEAR_ADDR_MASK 0x00FFFFFF /* ICH8 GbE Flash Hardware Sequencing Flash Status Register bit breakdown */ /* Offset 04h HSFSTS */ Index: linux/drivers/net/e1000/e1000_main.c =================================================================== --- linux.orig/drivers/net/e1000/e1000_main.c +++ linux/drivers/net/e1000/e1000_main.c @@ -27,6 +27,7 @@ *******************************************************************************/ #include "e1000.h" +#include char e1000_driver_name[] = "e1000"; static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver"; @@ -35,7 +36,7 @@ static char e1000_driver_string[] = "Int #else #define DRIVERNAPI "-NAPI" #endif -#define DRV_VERSION "7.2.9-k4"DRIVERNAPI +#define DRV_VERSION "7.3.15-k2"DRIVERNAPI char e1000_driver_version[] = DRV_VERSION; static char e1000_copyright[] = "Copyright (c) 1999-2006 Intel Corporation."; @@ -103,6 +104,9 @@ static struct pci_device_id e1000_pci_tb INTEL_E1000_ETHERNET_DEVICE(0x10B9), INTEL_E1000_ETHERNET_DEVICE(0x10BA), INTEL_E1000_ETHERNET_DEVICE(0x10BB), + INTEL_E1000_ETHERNET_DEVICE(0x10BC), + INTEL_E1000_ETHERNET_DEVICE(0x10C4), + INTEL_E1000_ETHERNET_DEVICE(0x10C5), /* required last entry */ {0,} }; @@ -154,6 +158,9 @@ static struct net_device_stats * e1000_g static int e1000_change_mtu(struct net_device *netdev, int new_mtu); static int e1000_set_mac(struct net_device *netdev, void *p); static irqreturn_t e1000_intr(int irq, void *data); +#ifdef CONFIG_PCI_MSI +static irqreturn_t e1000_intr_msi(int irq, void *data); +#endif static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring); #ifdef CONFIG_E1000_NAPI @@ -285,7 +292,7 @@ static int e1000_request_irq(struct e100 flags = IRQF_SHARED; #ifdef CONFIG_PCI_MSI - if (adapter->hw.mac_type > e1000_82547_rev_2) { + if (adapter->hw.mac_type >= e1000_82571) { adapter->have_msi = TRUE; if ((err = pci_enable_msi(adapter->pdev))) { DPRINTK(PROBE, ERR, @@ -293,8 +300,14 @@ static int e1000_request_irq(struct e100 adapter->have_msi = FALSE; } } - if (adapter->have_msi) + if (adapter->have_msi) { flags &= ~IRQF_SHARED; + err = request_irq(adapter->pdev->irq, &e1000_intr_msi, flags, + netdev->name, netdev); + if (err) + DPRINTK(PROBE, ERR, + "Unable to allocate interrupt Error: %d\n", err); + } else #endif if ((err = request_irq(adapter->pdev->irq, &e1000_intr, flags, netdev->name, netdev))) @@ -375,7 +388,7 @@ e1000_update_mng_vlan(struct e1000_adapt * e1000_release_hw_control resets {CTRL_EXT|FWSM}:DRV_LOAD bit. * For ASF and Pass Through versions of f/w this means that the * driver is no longer loaded. For AMT version (only with 82573) i - * of the f/w this means that the netowrk i/f is closed. + * of the f/w this means that the network i/f is closed. * **/ @@ -416,7 +429,7 @@ e1000_release_hw_control(struct e1000_ad * e1000_get_hw_control sets {CTRL_EXT|FWSM}:DRV_LOAD bit. * For ASF and Pass Through versions of f/w this means that * the driver is loaded. For AMT version (only with 82573) - * of the f/w this means that the netowrk i/f is open. + * of the f/w this means that the network i/f is open. * **/ @@ -426,6 +439,7 @@ e1000_get_hw_control(struct e1000_adapte uint32_t ctrl_ext; uint32_t swsm; uint32_t extcnf; + /* Let firmware know the driver has taken over */ switch (adapter->hw.mac_type) { case e1000_82571: @@ -483,7 +497,7 @@ e1000_up(struct e1000_adapter *adapter) clear_bit(__E1000_DOWN, &adapter->flags); - mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ); + mod_timer(&adapter->watchdog_timer, round_jiffies(jiffies + 2 * HZ)); return 0; } @@ -601,9 +615,6 @@ void e1000_reset(struct e1000_adapter *adapter) { uint32_t pba, manc; -#ifdef DISABLE_MULR - uint32_t tctl; -#endif uint16_t fc_high_water_mark = E1000_FC_HIGH_DIFF; /* Repartition Pba for greater than 9k mtu @@ -670,12 +681,7 @@ e1000_reset(struct e1000_adapter *adapte e1000_reset_hw(&adapter->hw); if (adapter->hw.mac_type >= e1000_82544) E1000_WRITE_REG(&adapter->hw, WUC, 0); -#ifdef DISABLE_MULR - /* disable Multiple Reads in Transmit Control Register for debugging */ - tctl = E1000_READ_REG(hw, TCTL); - E1000_WRITE_REG(hw, TCTL, tctl & ~E1000_TCTL_MULR); -#endif if (e1000_init_hw(&adapter->hw)) DPRINTK(PROBE, ERR, "Hardware Error\n"); e1000_update_mng_vlan(adapter); @@ -851,9 +857,9 @@ e1000_probe(struct pci_dev *pdev, (adapter->hw.mac_type != e1000_82547)) netdev->features |= NETIF_F_TSO; -#ifdef NETIF_F_TSO_IPV6 +#ifdef NETIF_F_TSO6 if (adapter->hw.mac_type > e1000_82547_rev_2) - netdev->features |= NETIF_F_TSO_IPV6; + netdev->features |= NETIF_F_TSO6; #endif #endif if (pci_using_dac) @@ -968,6 +974,7 @@ e1000_probe(struct pci_dev *pdev, break; case E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3: case E1000_DEV_ID_82571EB_QUAD_COPPER: + case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: /* if quad port adapter, disable WoL on all but port A */ if (global_quad_port_a != 0) adapter->eeprom_wol = 0; @@ -1279,12 +1286,10 @@ e1000_open(struct net_device *netdev) return -EBUSY; /* allocate transmit descriptors */ - if ((err = e1000_setup_all_tx_resources(adapter))) goto err_setup_tx; /* allocate receive descriptors */ - if ((err = e1000_setup_all_rx_resources(adapter))) goto err_setup_rx; @@ -1569,6 +1574,8 @@ e1000_configure_tx(struct e1000_adapter if (hw->mac_type == e1000_82571 || hw->mac_type == e1000_82572) { tarc = E1000_READ_REG(hw, TARC0); + /* set the speed mode bit, we'll clear it if we're not at + * gigabit link later */ tarc |= (1 << 21); E1000_WRITE_REG(hw, TARC0, tarc); } else if (hw->mac_type == e1000_80003es2lan) { @@ -1583,8 +1590,11 @@ e1000_configure_tx(struct e1000_adapter e1000_config_collision_dist(hw); /* Setup Transmit Descriptor Settings for eop descriptor */ - adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP | - E1000_TXD_CMD_IFCS; + adapter->txd_cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_IFCS; + + /* only set IDE if we are delaying interrupts using the timers */ + if (adapter->tx_int_delay) + adapter->txd_cmd |= E1000_TXD_CMD_IDE; if (hw->mac_type < e1000_82543) adapter->txd_cmd |= E1000_TXD_CMD_RPS; @@ -1821,8 +1831,11 @@ e1000_setup_rctl(struct e1000_adapter *a /* Configure extra packet-split registers */ rfctl = E1000_READ_REG(&adapter->hw, RFCTL); rfctl |= E1000_RFCTL_EXTEN; - /* disable IPv6 packet split support */ - rfctl |= E1000_RFCTL_IPV6_DIS; + /* disable packet split support for IPv6 extension headers, + * because some malformed IPv6 headers can hang the RX */ + rfctl |= (E1000_RFCTL_IPV6_EX_DIS | + E1000_RFCTL_NEW_IPV6_EXT_DIS); + E1000_WRITE_REG(&adapter->hw, RFCTL, rfctl); rctl |= E1000_RCTL_DTYP_PS; @@ -1885,7 +1898,7 @@ e1000_configure_rx(struct e1000_adapter if (hw->mac_type >= e1000_82540) { E1000_WRITE_REG(hw, RADV, adapter->rx_abs_int_delay); - if (adapter->itr > 1) + if (adapter->itr_setting != 0) E1000_WRITE_REG(hw, ITR, 1000000000 / (adapter->itr * 256)); } @@ -1895,11 +1908,11 @@ e1000_configure_rx(struct e1000_adapter /* Reset delay timers after every interrupt */ ctrl_ext |= E1000_CTRL_EXT_INT_TIMER_CLR; #ifdef CONFIG_E1000_NAPI - /* Auto-Mask interrupts upon ICR read. */ + /* Auto-Mask interrupts upon ICR access */ ctrl_ext |= E1000_CTRL_EXT_IAME; + E1000_WRITE_REG(hw, IAM, 0xffffffff); #endif E1000_WRITE_REG(hw, CTRL_EXT, ctrl_ext); - E1000_WRITE_REG(hw, IAM, ~0); E1000_WRITE_FLUSH(hw); } @@ -1938,6 +1951,12 @@ e1000_configure_rx(struct e1000_adapter E1000_WRITE_REG(hw, RXCSUM, rxcsum); } + /* enable early receives on 82573, only takes effect if using > 2048 + * byte total frame size. for example only for jumbo frames */ +#define E1000_ERT_2048 0x100 + if (hw->mac_type == e1000_82573) + E1000_WRITE_REG(hw, ERT, E1000_ERT_2048); + /* Enable Receives */ E1000_WRITE_REG(hw, RCTL, rctl); } @@ -1991,10 +2010,13 @@ e1000_unmap_and_free_tx_resource(struct buffer_info->dma, buffer_info->length, PCI_DMA_TODEVICE); + buffer_info->dma = 0; } - if (buffer_info->skb) + if (buffer_info->skb) { dev_kfree_skb_any(buffer_info->skb); - memset(buffer_info, 0, sizeof(struct e1000_buffer)); + buffer_info->skb = NULL; + } + /* buffer_info must be completely set up in the transmit path */ } /** @@ -2418,6 +2440,7 @@ e1000_watchdog(unsigned long data) DPRINTK(LINK, INFO, "Gigabit has been disabled, downgrading speed\n"); } + if (adapter->hw.mac_type == e1000_82573) { e1000_enable_tx_pkt_filtering(&adapter->hw); if (adapter->mng_vlan_id != adapter->hw.mng_cookie.vlan_id) @@ -2462,13 +2485,12 @@ e1000_watchdog(unsigned long data) if ((adapter->hw.mac_type == e1000_82571 || adapter->hw.mac_type == e1000_82572) && txb2b == 0) { -#define SPEED_MODE_BIT (1 << 21) uint32_t tarc0; tarc0 = E1000_READ_REG(&adapter->hw, TARC0); - tarc0 &= ~SPEED_MODE_BIT; + tarc0 &= ~(1 << 21); E1000_WRITE_REG(&adapter->hw, TARC0, tarc0); } - + #ifdef NETIF_F_TSO /* disable TSO for pcie and 10/100 speeds, to avoid * some hardware issues */ @@ -2480,9 +2502,15 @@ e1000_watchdog(unsigned long data) DPRINTK(PROBE,INFO, "10/100 speed: disabling TSO\n"); netdev->features &= ~NETIF_F_TSO; +#ifdef NETIF_F_TSO6 + netdev->features &= ~NETIF_F_TSO6; +#endif break; case SPEED_1000: netdev->features |= NETIF_F_TSO; +#ifdef NETIF_F_TSO6 + netdev->features |= NETIF_F_TSO6; +#endif break; default: /* oops */ @@ -2499,7 +2527,7 @@ e1000_watchdog(unsigned long data) netif_carrier_on(netdev); netif_wake_queue(netdev); - mod_timer(&adapter->phy_info_timer, jiffies + 2 * HZ); + mod_timer(&adapter->phy_info_timer, round_jiffies(jiffies + 2 * HZ)); adapter->smartspeed = 0; } } else { @@ -2509,7 +2537,7 @@ e1000_watchdog(unsigned long data) DPRINTK(LINK, INFO, "NIC Link is Down\n"); netif_carrier_off(netdev); netif_stop_queue(netdev); - mod_timer(&adapter->phy_info_timer, jiffies + 2 * HZ); + mod_timer(&adapter->phy_info_timer, round_jiffies(jiffies + 2 * HZ)); /* 80003ES2LAN workaround-- * For packet buffer work-around on link down event; @@ -2549,19 +2577,6 @@ e1000_watchdog(unsigned long data) } } - /* Dynamic mode for Interrupt Throttle Rate (ITR) */ - if (adapter->hw.mac_type >= e1000_82540 && adapter->itr == 1) { - /* Symmetric Tx/Rx gets a reduced ITR=2000; Total - * asymmetrical Tx or Rx gets ITR=8000; everyone - * else is between 2000-8000. */ - uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000; - uint32_t dif = (adapter->gotcl > adapter->gorcl ? - adapter->gotcl - adapter->gorcl : - adapter->gorcl - adapter->gotcl) / 10000; - uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000; - E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256)); - } - /* Cause software interrupt to ensure rx ring is cleaned */ E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_RXDMT0); @@ -2574,7 +2589,136 @@ e1000_watchdog(unsigned long data) e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0); /* Reset the timer */ - mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ); + mod_timer(&adapter->watchdog_timer, round_jiffies(jiffies + 2 * HZ)); +} + +enum latency_range { + lowest_latency = 0, + low_latency = 1, + bulk_latency = 2, + latency_invalid = 255 +}; + +/** + * e1000_update_itr - update the dynamic ITR value based on statistics + * Stores a new ITR value based on packets and byte + * counts during the last interrupt. The advantage of per interrupt + * computation is faster updates and more accurate ITR for the current + * traffic pattern. Constants in this function were computed + * based on theoretical maximum wire speed and thresholds were set based + * on testing data as well as attempting to minimize response time + * while increasing bulk throughput. + * this functionality is controlled by the InterruptThrottleRate module + * parameter (see e1000_param.c) + * @adapter: pointer to adapter + * @itr_setting: current adapter->itr + * @packets: the number of packets during this measurement interval + * @bytes: the number of bytes during this measurement interval + **/ +static unsigned int e1000_update_itr(struct e1000_adapter *adapter, + uint16_t itr_setting, + int packets, + int bytes) +{ + unsigned int retval = itr_setting; + struct e1000_hw *hw = &adapter->hw; + + if (unlikely(hw->mac_type < e1000_82540)) + goto update_itr_done; + + if (packets == 0) + goto update_itr_done; + + + switch (itr_setting) { + case lowest_latency: + if ((packets < 5) && (bytes > 512)) + retval = low_latency; + break; + case low_latency: /* 50 usec aka 20000 ints/s */ + if (bytes > 10000) { + if ((packets < 10) || + ((bytes/packets) > 1200)) + retval = bulk_latency; + else if ((packets > 35)) + retval = lowest_latency; + } else if (packets <= 2 && bytes < 512) + retval = lowest_latency; + break; + case bulk_latency: /* 250 usec aka 4000 ints/s */ + if (bytes > 25000) { + if (packets > 35) + retval = low_latency; + } else { + if (bytes < 6000) + retval = low_latency; + } + break; + } + +update_itr_done: + return retval; +} + +static void e1000_set_itr(struct e1000_adapter *adapter) +{ + struct e1000_hw *hw = &adapter->hw; + uint16_t current_itr; + uint32_t new_itr = adapter->itr; + + if (unlikely(hw->mac_type < e1000_82540)) + return; + + /* for non-gigabit speeds, just fix the interrupt rate at 4000 */ + if (unlikely(adapter->link_speed != SPEED_1000)) { + current_itr = 0; + new_itr = 4000; + goto set_itr_now; + } + + adapter->tx_itr = e1000_update_itr(adapter, + adapter->tx_itr, + adapter->total_tx_packets, + adapter->total_tx_bytes); + adapter->rx_itr = e1000_update_itr(adapter, + adapter->rx_itr, + adapter->total_rx_packets, + adapter->total_rx_bytes); + + current_itr = max(adapter->rx_itr, adapter->tx_itr); + + /* conservative mode eliminates the lowest_latency setting */ + if (current_itr == lowest_latency && (adapter->itr_setting == 3)) + current_itr = low_latency; + + switch (current_itr) { + /* counts and packets in update_itr are dependent on these numbers */ + case lowest_latency: + new_itr = 70000; + break; + case low_latency: + new_itr = 20000; /* aka hwitr = ~200 */ + break; + case bulk_latency: + new_itr = 4000; + break; + default: + break; + } + +set_itr_now: + if (new_itr != adapter->itr) { + /* this attempts to bias the interrupt rate towards Bulk + * by adding intermediate steps when interrupt rate is + * increasing */ + new_itr = new_itr > adapter->itr ? + min(adapter->itr + (new_itr >> 2), new_itr) : + new_itr; + adapter->itr = new_itr; + E1000_WRITE_REG(hw, ITR, 1000000000 / (new_itr * 256)); + } + + return; } #define E1000_TX_FLAGS_CSUM 0x00000001 @@ -2617,7 +2761,7 @@ e1000_tso(struct e1000_adapter *adapter, 0); cmd_length = E1000_TXD_CMD_IP; ipcse = skb->h.raw - skb->data - 1; -#ifdef NETIF_F_TSO_IPV6 +#ifdef NETIF_F_TSO6 } else if (skb->protocol == htons(ETH_P_IPV6)) { skb->nh.ipv6h->payload_len = 0; skb->h.th->check = @@ -2653,6 +2797,7 @@ e1000_tso(struct e1000_adapter *adapter, context_desc->cmd_and_length = cpu_to_le32(cmd_length); buffer_info->time_stamp = jiffies; + buffer_info->next_to_watch = i; if (++i == tx_ring->count) i = 0; tx_ring->next_to_use = i; @@ -2687,6 +2832,7 @@ e1000_tx_csum(struct e1000_adapter *adap context_desc->cmd_and_length = cpu_to_le32(E1000_TXD_CMD_DEXT); buffer_info->time_stamp = jiffies; + buffer_info->next_to_watch = i; if (unlikely(++i == tx_ring->count)) i = 0; tx_ring->next_to_use = i; @@ -2755,6 +2901,7 @@ e1000_tx_map(struct e1000_adapter *adapt size, PCI_DMA_TODEVICE); buffer_info->time_stamp = jiffies; + buffer_info->next_to_watch = i; len -= size; offset += size; @@ -2794,6 +2941,7 @@ e1000_tx_map(struct e1000_adapter *adapt size, PCI_DMA_TODEVICE); buffer_info->time_stamp = jiffies; + buffer_info->next_to_watch = i; len -= size; offset += size; @@ -2859,6 +3007,9 @@ e1000_tx_queue(struct e1000_adapter *ada tx_ring->next_to_use = i; writel(i, adapter->hw.hw_addr + tx_ring->tdt); + /* we need this if more than one processor can write to our tail + * at a time, it syncronizes IO on IA64/Altix systems */ + mmiowb(); } /** @@ -2952,6 +3103,7 @@ static int __e1000_maybe_stop_tx(struct /* A reprieve! */ netif_start_queue(netdev); + ++adapter->restart_queue; return 0; } @@ -3010,9 +3162,9 @@ e1000_xmit_frame(struct sk_buff *skb, st max_per_txd = min(mss << 2, max_per_txd); max_txd_pwr = fls(max_per_txd) - 1; - /* TSO Workaround for 82571/2/3 Controllers -- if skb->data - * points to just header, pull a few bytes of payload from - * frags into skb->data */ + /* TSO Workaround for 82571/2/3 Controllers -- if skb->data + * points to just header, pull a few bytes of payload from + * frags into skb->data */ hdr_len = ((skb->h.raw - skb->data) + (skb->h.th->doff << 2)); if (skb->data_len && (hdr_len == (skb->len - skb->data_len))) { switch (adapter->hw.mac_type) { @@ -3076,10 +3228,8 @@ e1000_xmit_frame(struct sk_buff *skb, st (adapter->hw.mac_type == e1000_82573)) e1000_transfer_dhcp_info(adapter, skb); - local_irq_save(flags); - if (!spin_trylock(&tx_ring->tx_lock)) { + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) { /* Collision - tell upper layer to requeue */ - local_irq_restore(flags); return NETDEV_TX_LOCKED; } @@ -3316,12 +3466,12 @@ e1000_update_stats(struct e1000_adapter adapter->stats.roc += E1000_READ_REG(hw, ROC); if (adapter->hw.mac_type != e1000_ich8lan) { - adapter->stats.prc64 += E1000_READ_REG(hw, PRC64); - adapter->stats.prc127 += E1000_READ_REG(hw, PRC127); - adapter->stats.prc255 += E1000_READ_REG(hw, PRC255); - adapter->stats.prc511 += E1000_READ_REG(hw, PRC511); - adapter->stats.prc1023 += E1000_READ_REG(hw, PRC1023); - adapter->stats.prc1522 += E1000_READ_REG(hw, PRC1522); + adapter->stats.prc64 += E1000_READ_REG(hw, PRC64); + adapter->stats.prc127 += E1000_READ_REG(hw, PRC127); + adapter->stats.prc255 += E1000_READ_REG(hw, PRC255); + adapter->stats.prc511 += E1000_READ_REG(hw, PRC511); + adapter->stats.prc1023 += E1000_READ_REG(hw, PRC1023); + adapter->stats.prc1522 += E1000_READ_REG(hw, PRC1522); } adapter->stats.symerrs += E1000_READ_REG(hw, SYMERRS); @@ -3352,12 +3502,12 @@ e1000_update_stats(struct e1000_adapter adapter->stats.tpr += E1000_READ_REG(hw, TPR); if (adapter->hw.mac_type != e1000_ich8lan) { - adapter->stats.ptc64 += E1000_READ_REG(hw, PTC64); - adapter->stats.ptc127 += E1000_READ_REG(hw, PTC127); - adapter->stats.ptc255 += E1000_READ_REG(hw, PTC255); - adapter->stats.ptc511 += E1000_READ_REG(hw, PTC511); - adapter->stats.ptc1023 += E1000_READ_REG(hw, PTC1023); - adapter->stats.ptc1522 += E1000_READ_REG(hw, PTC1522); + adapter->stats.ptc64 += E1000_READ_REG(hw, PTC64); + adapter->stats.ptc127 += E1000_READ_REG(hw, PTC127); + adapter->stats.ptc255 += E1000_READ_REG(hw, PTC255); + adapter->stats.ptc511 += E1000_READ_REG(hw, PTC511); + adapter->stats.ptc1023 += E1000_READ_REG(hw, PTC1023); + adapter->stats.ptc1522 += E1000_READ_REG(hw, PTC1522); } adapter->stats.mptc += E1000_READ_REG(hw, MPTC); @@ -3383,18 +3533,17 @@ e1000_update_stats(struct e1000_adapter adapter->stats.icrxoc += E1000_READ_REG(hw, ICRXOC); if (adapter->hw.mac_type != e1000_ich8lan) { - adapter->stats.icrxptc += E1000_READ_REG(hw, ICRXPTC); - adapter->stats.icrxatc += E1000_READ_REG(hw, ICRXATC); - adapter->stats.ictxptc += E1000_READ_REG(hw, ICTXPTC); - adapter->stats.ictxatc += E1000_READ_REG(hw, ICTXATC); - adapter->stats.ictxqec += E1000_READ_REG(hw, ICTXQEC); - adapter->stats.ictxqmtc += E1000_READ_REG(hw, ICTXQMTC); - adapter->stats.icrxdmtc += E1000_READ_REG(hw, ICRXDMTC); + adapter->stats.icrxptc += E1000_READ_REG(hw, ICRXPTC); + adapter->stats.icrxatc += E1000_READ_REG(hw, ICRXATC); + adapter->stats.ictxptc += E1000_READ_REG(hw, ICTXPTC); + adapter->stats.ictxatc += E1000_READ_REG(hw, ICTXATC); + adapter->stats.ictxqec += E1000_READ_REG(hw, ICTXQEC); + adapter->stats.ictxqmtc += E1000_READ_REG(hw, ICTXQMTC); + adapter->stats.icrxdmtc += E1000_READ_REG(hw, ICRXDMTC); } } /* Fill out the OS statistics structure */ - adapter->net_stats.rx_packets = adapter->stats.gprc; adapter->net_stats.tx_packets = adapter->stats.gptc; adapter->net_stats.rx_bytes = adapter->stats.gorcl; @@ -3426,7 +3575,6 @@ e1000_update_stats(struct e1000_adapter /* Tx Dropped needs to be maintained elsewhere */ /* Phy Stats */ - if (hw->media_type == e1000_media_type_copper) { if ((adapter->link_speed == SPEED_1000) && (!e1000_read_phy_reg(hw, PHY_1000T_STATUS, &phy_tmp))) { @@ -3442,6 +3590,95 @@ e1000_update_stats(struct e1000_adapter spin_unlock_irqrestore(&adapter->stats_lock, flags); } +#ifdef CONFIG_PCI_MSI + +/** + * e1000_intr_msi - Interrupt Handler + * @irq: interrupt number + * @data: pointer to a network interface device structure + **/ + +static +irqreturn_t e1000_intr_msi(int irq, void *data) +{ + struct net_device *netdev = data; + struct e1000_adapter *adapter = netdev_priv(netdev); + struct e1000_hw *hw = &adapter->hw; +#ifndef CONFIG_E1000_NAPI + int i; +#endif + + /* this code avoids the read of ICR but has to get 1000 interrupts + * at every link change event before it will notice the change */ + if (++adapter->detect_link >= 1000) { + uint32_t icr = E1000_READ_REG(hw, ICR); +#ifdef CONFIG_E1000_NAPI + /* read ICR disables interrupts using IAM, so keep up with our + * enable/disable accounting */ + atomic_inc(&adapter->irq_sem); +#endif + adapter->detect_link = 0; + if ((icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) && + (icr & E1000_ICR_INT_ASSERTED)) { + hw->get_link_status = 1; + /* 80003ES2LAN workaround-- + * For packet buffer work-around on link down event; + * disable receives here in the ISR and + * reset adapter in watchdog + */ + if (netif_carrier_ok(netdev) && + (adapter->hw.mac_type == e1000_80003es2lan)) { + /* disable receives */ + uint32_t rctl = E1000_READ_REG(hw, RCTL); + E1000_WRITE_REG(hw, RCTL, rctl & ~E1000_RCTL_EN); + } + /* guard against interrupt when we're going down */ + if (!test_bit(__E1000_DOWN, &adapter->flags)) + mod_timer(&adapter->watchdog_timer, + jiffies + 1); + } + } else { + E1000_WRITE_REG(hw, ICR, (0xffffffff & ~(E1000_ICR_RXSEQ | + E1000_ICR_LSC))); + /* bummer we have to flush here, but things break otherwise as + * some event appears to be lost or delayed and throughput + * drops. In almost all tests this flush is un-necessary */ + E1000_WRITE_FLUSH(hw); +#ifdef CONFIG_E1000_NAPI + /* Interrupt Auto-Mask (IAM)...upon writing ICR, interrupts are + * masked. No need for the IMC write, but it does mean we + * should account for it ASAP. */ + atomic_inc(&adapter->irq_sem); +#endif + } + +#ifdef CONFIG_E1000_NAPI + if (likely(netif_rx_schedule_prep(netdev))) { + adapter->total_tx_bytes = 0; + adapter->total_tx_packets = 0; + adapter->total_rx_bytes = 0; + adapter->total_rx_packets = 0; + __netif_rx_schedule(netdev); + } else + e1000_irq_enable(adapter); +#else + adapter->total_tx_bytes = 0; + adapter->total_rx_bytes = 0; + adapter->total_tx_packets = 0; + adapter->total_rx_packets = 0; + + for (i = 0; i < E1000_MAX_INTR; i++) + if (unlikely(!adapter->clean_rx(adapter, adapter->rx_ring) & + !e1000_clean_tx_irq(adapter, adapter->tx_ring))) + break; + + if (likely(adapter->itr_setting & 3)) + e1000_set_itr(adapter); +#endif + + return IRQ_HANDLED; +} +#endif /** * e1000_intr - Interrupt Handler @@ -3458,7 +3695,17 @@ e1000_intr(int irq, void *data) uint32_t rctl, icr = E1000_READ_REG(hw, ICR); #ifndef CONFIG_E1000_NAPI int i; -#else +#endif + if (unlikely(!icr)) + return IRQ_NONE; /* Not our interrupt */ + +#ifdef CONFIG_E1000_NAPI + /* IMS will not auto-mask if INT_ASSERTED is not set, and if it is + * not set, then the adapter didn't send an interrupt */ + if (unlikely(hw->mac_type >= e1000_82571 && + !(icr & E1000_ICR_INT_ASSERTED))) + return IRQ_NONE; + /* Interrupt Auto-Mask...upon reading ICR, * interrupts are masked. No need for the * IMC write, but it does mean we should @@ -3467,14 +3714,6 @@ e1000_intr(int irq, void *data) atomic_inc(&adapter->irq_sem); #endif - if (unlikely(!icr)) { -#ifdef CONFIG_E1000_NAPI - if (hw->mac_type >= e1000_82571) - e1000_irq_enable(adapter); -#endif - return IRQ_NONE; /* Not our interrupt */ - } - if (unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) { hw->get_link_status = 1; /* 80003ES2LAN workaround-- @@ -3495,13 +3734,20 @@ e1000_intr(int irq, void *data) #ifdef CONFIG_E1000_NAPI if (unlikely(hw->mac_type < e1000_82571)) { + /* disable interrupts, without the synchronize_irq bit */ atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(hw, IMC, ~0); E1000_WRITE_FLUSH(hw); } - if (likely(netif_rx_schedule_prep(netdev))) + if (likely(netif_rx_schedule_prep(netdev))) { + adapter->total_tx_bytes = 0; + adapter->total_tx_packets = 0; + adapter->total_rx_bytes = 0; + adapter->total_rx_packets = 0; __netif_rx_schedule(netdev); - else + } else + /* this really should not happen! if it does it is basically a + * bug, but not a hard error, so enable ints and continue */ e1000_irq_enable(adapter); #else /* Writing IMC and IMS is needed for 82547. @@ -3519,16 +3765,23 @@ e1000_intr(int irq, void *data) E1000_WRITE_REG(hw, IMC, ~0); } + adapter->total_tx_bytes = 0; + adapter->total_rx_bytes = 0; + adapter->total_tx_packets = 0; + adapter->total_rx_packets = 0; + for (i = 0; i < E1000_MAX_INTR; i++) if (unlikely(!adapter->clean_rx(adapter, adapter->rx_ring) & !e1000_clean_tx_irq(adapter, adapter->tx_ring))) break; + if (likely(adapter->itr_setting & 3)) + e1000_set_itr(adapter); + if (hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2) e1000_irq_enable(adapter); #endif - return IRQ_HANDLED; } @@ -3572,6 +3825,8 @@ e1000_clean(struct net_device *poll_dev, if ((!tx_cleaned && (work_done == 0)) || !netif_running(poll_dev)) { quit_polling: + if (likely(adapter->itr_setting & 3)) + e1000_set_itr(adapter); netif_rx_complete(poll_dev); e1000_irq_enable(adapter); return 0; @@ -3598,6 +3853,7 @@ e1000_clean_tx_irq(struct e1000_adapter unsigned int count = 0; #endif boolean_t cleaned = FALSE; + unsigned int total_tx_bytes=0, total_tx_packets=0; i = tx_ring->next_to_clean; eop = tx_ring->buffer_info[i].next_to_watch; @@ -3609,13 +3865,19 @@ e1000_clean_tx_irq(struct e1000_adapter buffer_info = &tx_ring->buffer_info[i]; cleaned = (i == eop); + if (cleaned) { + /* this packet count is wrong for TSO but has a + * tendency to make dynamic ITR change more + * towards bulk */ + total_tx_packets++; + total_tx_bytes += buffer_info->skb->len; + } e1000_unmap_and_free_tx_resource(adapter, buffer_info); - memset(tx_desc, 0, sizeof(struct e1000_tx_desc)); + tx_desc->upper.data = 0; if (unlikely(++i == tx_ring->count)) i = 0; } - eop = tx_ring->buffer_info[i].next_to_watch; eop_desc = E1000_TX_DESC(*tx_ring, eop); #ifdef CONFIG_E1000_NAPI @@ -3634,8 +3896,10 @@ e1000_clean_tx_irq(struct e1000_adapter * sees the new next_to_clean. */ smp_mb(); - if (netif_queue_stopped(netdev)) + if (netif_queue_stopped(netdev)) { netif_wake_queue(netdev); + ++adapter->restart_queue; + } } if (adapter->detect_tx_hung) { @@ -3673,6 +3937,8 @@ e1000_clean_tx_irq(struct e1000_adapter netif_stop_queue(netdev); } } + adapter->total_tx_bytes += total_tx_bytes; + adapter->total_tx_packets += total_tx_packets; return cleaned; } @@ -3752,6 +4018,7 @@ e1000_clean_rx_irq(struct e1000_adapter unsigned int i; int cleaned_count = 0; boolean_t cleaned = FALSE; + unsigned int total_rx_bytes=0, total_rx_packets=0; i = rx_ring->next_to_clean; rx_desc = E1000_RX_DESC(*rx_ring, i); @@ -3760,6 +4027,7 @@ e1000_clean_rx_irq(struct e1000_adapter while (rx_desc->status & E1000_RXD_STAT_DD) { struct sk_buff *skb; u8 status; + #ifdef CONFIG_E1000_NAPI if (*work_done >= work_to_do) break; @@ -3817,6 +4085,10 @@ e1000_clean_rx_irq(struct e1000_adapter * done after the TBI_ACCEPT workaround above */ length -= 4; + /* probably a little skewed due to removing CRC */ + total_rx_bytes += length; + total_rx_packets++; + /* code added for copybreak, this should improve * performance for small packets with large amounts * of reassembly being done in the stack */ @@ -3832,12 +4104,11 @@ e1000_clean_rx_irq(struct e1000_adapter /* save the skb in buffer_info as good */ buffer_info->skb = skb; skb = new_skb; - skb_put(skb, length); } - } else - skb_put(skb, length); - + /* else just continue with the old one */ + } /* end copybreak code */ + skb_put(skb, length); /* Receive Checksum Offload */ e1000_rx_checksum(adapter, @@ -3886,6 +4157,8 @@ next_desc: if (cleaned_count) adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count); + adapter->total_rx_packets += total_rx_packets; + adapter->total_rx_bytes += total_rx_bytes; return cleaned; } @@ -3915,6 +4188,7 @@ e1000_clean_rx_irq_ps(struct e1000_adapt uint32_t length, staterr; int cleaned_count = 0; boolean_t cleaned = FALSE; + unsigned int total_rx_bytes=0, total_rx_packets=0; i = rx_ring->next_to_clean; rx_desc = E1000_RX_DESC_PS(*rx_ring, i); @@ -3999,7 +4273,7 @@ e1000_clean_rx_irq_ps(struct e1000_adapt goto copydone; } /* if */ } - + for (j = 0; j < adapter->rx_ps_pages; j++) { if (!(length= le16_to_cpu(rx_desc->wb.upper.length[j]))) break; @@ -4019,6 +4293,9 @@ e1000_clean_rx_irq_ps(struct e1000_adapt pskb_trim(skb, skb->len - 4); copydone: + total_rx_bytes += skb->len; + total_rx_packets++; + e1000_rx_checksum(adapter, staterr, le16_to_cpu(rx_desc->wb.lower.hi_dword.csum_ip.csum), skb); skb->protocol = eth_type_trans(skb, netdev); @@ -4067,6 +4344,8 @@ next_desc: if (cleaned_count) adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count); + adapter->total_rx_packets += total_rx_packets; + adapter->total_rx_bytes += total_rx_bytes; return cleaned; } @@ -4234,7 +4513,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a } skb = netdev_alloc_skb(netdev, - adapter->rx_ps_bsize0 + NET_IP_ALIGN); + adapter->rx_ps_bsize0 + NET_IP_ALIGN); if (unlikely(!skb)) { adapter->alloc_rx_buff_failed++; @@ -4511,7 +4790,6 @@ e1000_read_pcie_cap_reg(struct e1000_hw return E1000_SUCCESS; } - void e1000_io_write(struct e1000_hw *hw, unsigned long port, uint32_t value) { @@ -4534,12 +4812,12 @@ e1000_vlan_rx_register(struct net_device E1000_WRITE_REG(&adapter->hw, CTRL, ctrl); if (adapter->hw.mac_type != e1000_ich8lan) { - /* enable VLAN receive filtering */ - rctl = E1000_READ_REG(&adapter->hw, RCTL); - rctl |= E1000_RCTL_VFE; - rctl &= ~E1000_RCTL_CFIEN; - E1000_WRITE_REG(&adapter->hw, RCTL, rctl); - e1000_update_mng_vlan(adapter); + /* enable VLAN receive filtering */ + rctl = E1000_READ_REG(&adapter->hw, RCTL); + rctl |= E1000_RCTL_VFE; + rctl &= ~E1000_RCTL_CFIEN; + E1000_WRITE_REG(&adapter->hw, RCTL, rctl); + e1000_update_mng_vlan(adapter); } } else { /* disable VLAN tag insert/strip */ @@ -4548,14 +4826,16 @@ e1000_vlan_rx_register(struct net_device E1000_WRITE_REG(&adapter->hw, CTRL, ctrl); if (adapter->hw.mac_type != e1000_ich8lan) { - /* disable VLAN filtering */ - rctl = E1000_READ_REG(&adapter->hw, RCTL); - rctl &= ~E1000_RCTL_VFE; - E1000_WRITE_REG(&adapter->hw, RCTL, rctl); - if (adapter->mng_vlan_id != (uint16_t)E1000_MNG_VLAN_NONE) { - e1000_vlan_rx_kill_vid(netdev, adapter->mng_vlan_id); - adapter->mng_vlan_id = E1000_MNG_VLAN_NONE; - } + /* disable VLAN filtering */ + rctl = E1000_READ_REG(&adapter->hw, RCTL); + rctl &= ~E1000_RCTL_VFE; + E1000_WRITE_REG(&adapter->hw, RCTL, rctl); + if (adapter->mng_vlan_id != + (uint16_t)E1000_MNG_VLAN_NONE) { + e1000_vlan_rx_kill_vid(netdev, + adapter->mng_vlan_id); + adapter->mng_vlan_id = E1000_MNG_VLAN_NONE; + } } } Index: linux/drivers/net/e1000/e1000_osdep.h =================================================================== --- linux.orig/drivers/net/e1000/e1000_osdep.h +++ linux/drivers/net/e1000/e1000_osdep.h @@ -107,17 +107,16 @@ typedef enum { #define E1000_WRITE_FLUSH(a) E1000_READ_REG(a, STATUS) -#define E1000_WRITE_ICH8_REG(a, reg, value) ( \ +#define E1000_WRITE_ICH_FLASH_REG(a, reg, value) ( \ writel((value), ((a)->flash_address + reg))) -#define E1000_READ_ICH8_REG(a, reg) ( \ +#define E1000_READ_ICH_FLASH_REG(a, reg) ( \ readl((a)->flash_address + reg)) -#define E1000_WRITE_ICH8_REG16(a, reg, value) ( \ +#define E1000_WRITE_ICH_FLASH_REG16(a, reg, value) ( \ writew((value), ((a)->flash_address + reg))) -#define E1000_READ_ICH8_REG16(a, reg) ( \ +#define E1000_READ_ICH_FLASH_REG16(a, reg) ( \ readw((a)->flash_address + reg)) - #endif /* _E1000_OSDEP_H_ */ Index: linux/drivers/net/e1000/e1000_param.c =================================================================== --- linux.orig/drivers/net/e1000/e1000_param.c +++ linux/drivers/net/e1000/e1000_param.c @@ -44,16 +44,6 @@ */ #define E1000_PARAM_INIT { [0 ... E1000_MAX_NIC] = OPTION_UNSET } -/* Module Parameters are always initialized to -1, so that the driver - * can tell the difference between no user specified value or the - * user asking for the default value. - * The true default values are loaded in when e1000_check_options is called. - * - * This is a GCC extension to ANSI C. - * See the item "Labeled Elements in Initializers" in the section - * "Extensions to the C Language Family" of the GCC documentation. - */ - #define E1000_PARAM(X, desc) \ static int __devinitdata X[E1000_MAX_NIC+1] = E1000_PARAM_INIT; \ static int num_##X = 0; \ @@ -67,7 +57,6 @@ * * Default Value: 256 */ - E1000_PARAM(TxDescriptors, "Number of transmit descriptors"); /* Receive Descriptor Count @@ -77,7 +66,6 @@ E1000_PARAM(TxDescriptors, "Number of tr * * Default Value: 256 */ - E1000_PARAM(RxDescriptors, "Number of receive descriptors"); /* User Specified Speed Override @@ -90,7 +78,6 @@ E1000_PARAM(RxDescriptors, "Number of re * * Default Value: 0 */ - E1000_PARAM(Speed, "Speed setting"); /* User Specified Duplex Override @@ -102,7 +89,6 @@ E1000_PARAM(Speed, "Speed setting"); * * Default Value: 0 */ - E1000_PARAM(Duplex, "Duplex setting"); /* Auto-negotiation Advertisement Override @@ -119,8 +105,9 @@ E1000_PARAM(Duplex, "Duplex setting"); * * Default Value: 0x2F (copper); 0x20 (fiber) */ - E1000_PARAM(AutoNeg, "Advertised auto-negotiation setting"); +#define AUTONEG_ADV_DEFAULT 0x2F +#define AUTONEG_ADV_MASK 0x2F /* User Specified Flow Control Override * @@ -132,8 +119,8 @@ E1000_PARAM(AutoNeg, "Advertised auto-ne * * Default Value: Read flow control settings from the EEPROM */ - E1000_PARAM(FlowControl, "Flow Control setting"); +#define FLOW_CONTROL_DEFAULT FLOW_CONTROL_FULL /* XsumRX - Receive Checksum Offload Enable/Disable * @@ -144,53 +131,54 @@ E1000_PARAM(FlowControl, "Flow Control s * * Default Value: 1 */ - E1000_PARAM(XsumRX, "Disable or enable Receive Checksum offload"); /* Transmit Interrupt Delay in units of 1.024 microseconds + * Tx interrupt delay needs to typically be set to something non zero * * Valid Range: 0-65535 - * - * Default Value: 64 */ - E1000_PARAM(TxIntDelay, "Transmit Interrupt Delay"); +#define DEFAULT_TIDV 8 +#define MAX_TXDELAY 0xFFFF +#define MIN_TXDELAY 0 /* Transmit Absolute Interrupt Delay in units of 1.024 microseconds * * Valid Range: 0-65535 - * - * Default Value: 0 */ - E1000_PARAM(TxAbsIntDelay, "Transmit Absolute Interrupt Delay"); +#define DEFAULT_TADV 32 +#define MAX_TXABSDELAY 0xFFFF +#define MIN_TXABSDELAY 0 /* Receive Interrupt Delay in units of 1.024 microseconds + * hardware will likely hang if you set this to anything but zero. * * Valid Range: 0-65535 - * - * Default Value: 0 */ - E1000_PARAM(RxIntDelay, "Receive Interrupt Delay"); +#define DEFAULT_RDTR 0 +#define MAX_RXDELAY 0xFFFF +#define MIN_RXDELAY 0 /* Receive Absolute Interrupt Delay in units of 1.024 microseconds * * Valid Range: 0-65535 - * - * Default Value: 128 */ - E1000_PARAM(RxAbsIntDelay, "Receive Absolute Interrupt Delay"); +#define DEFAULT_RADV 8 +#define MAX_RXABSDELAY 0xFFFF +#define MIN_RXABSDELAY 0 /* Interrupt Throttle Rate (interrupts/sec) * - * Valid Range: 100-100000 (0=off, 1=dynamic) - * - * Default Value: 8000 + * Valid Range: 100-100000 (0=off, 1=dynamic, 3=dynamic conservative) */ - E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate"); +#define DEFAULT_ITR 3 +#define MAX_ITR 100000 +#define MIN_ITR 100 /* Enable Smart Power Down of the PHY * @@ -198,7 +186,6 @@ E1000_PARAM(InterruptThrottleRate, "Inte * * Default Value: 0 (disabled) */ - E1000_PARAM(SmartPowerDownEnable, "Enable PHY smart power down"); /* Enable Kumeran Lock Loss workaround @@ -207,33 +194,8 @@ E1000_PARAM(SmartPowerDownEnable, "Enabl * * Default Value: 1 (enabled) */ - E1000_PARAM(KumeranLockLoss, "Enable Kumeran lock loss workaround"); -#define AUTONEG_ADV_DEFAULT 0x2F -#define AUTONEG_ADV_MASK 0x2F -#define FLOW_CONTROL_DEFAULT FLOW_CONTROL_FULL - -#define DEFAULT_RDTR 0 -#define MAX_RXDELAY 0xFFFF -#define MIN_RXDELAY 0 - -#define DEFAULT_RADV 128 -#define MAX_RXABSDELAY 0xFFFF -#define MIN_RXABSDELAY 0 - -#define DEFAULT_TIDV 64 -#define MAX_TXDELAY 0xFFFF -#define MIN_TXDELAY 0 - -#define DEFAULT_TADV 64 -#define MAX_TXABSDELAY 0xFFFF -#define MIN_TXABSDELAY 0 - -#define DEFAULT_ITR 8000 -#define MAX_ITR 100000 -#define MIN_ITR 100 - struct e1000_option { enum { enable_option, range_option, list_option } type; char *name; @@ -510,15 +472,27 @@ e1000_check_options(struct e1000_adapter break; case 1: DPRINTK(PROBE, INFO, "%s set to dynamic mode\n", - opt.name); + opt.name); + adapter->itr_setting = adapter->itr; + adapter->itr = 20000; + break; + case 3: + DPRINTK(PROBE, INFO, + "%s set to dynamic conservative mode\n", + opt.name); + adapter->itr_setting = adapter->itr; + adapter->itr = 20000; break; default: e1000_validate_option(&adapter->itr, &opt, - adapter); + adapter); + /* save the setting, because the dynamic bits change itr */ + adapter->itr_setting = adapter->itr; break; } } else { - adapter->itr = opt.def; + adapter->itr_setting = opt.def; + adapter->itr = 20000; } } { /* Smart Power Down */ Index: linux/drivers/net/hamradio/6pack.c =================================================================== --- linux.orig/drivers/net/hamradio/6pack.c +++ linux/drivers/net/hamradio/6pack.c @@ -123,7 +123,7 @@ struct sixpack { struct timer_list tx_t; struct timer_list resync_t; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; spinlock_t lock; }; Index: linux/drivers/net/hamradio/mkiss.c =================================================================== --- linux.orig/drivers/net/hamradio/mkiss.c +++ linux/drivers/net/hamradio/mkiss.c @@ -84,7 +84,7 @@ struct mkiss { #define CRC_MODE_SMACK_TEST 4 atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; }; /*---------------------------------------------------------------------------*/ Index: linux/drivers/net/ibm_emac/ibm_emac_core.c =================================================================== --- linux.orig/drivers/net/ibm_emac/ibm_emac_core.c +++ linux/drivers/net/ibm_emac/ibm_emac_core.c @@ -1061,6 +1061,8 @@ static inline int emac_xmit_finish(struc ++dev->stats.tx_packets; dev->stats.tx_bytes += len; + spin_unlock(&dev->tx_lock); + return 0; } @@ -1074,6 +1076,7 @@ static int emac_start_xmit(struct sk_buf u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY | MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb); + spin_lock(&dev->tx_lock); slot = dev->tx_slot++; if (dev->tx_slot == NUM_TX_BUFF) { dev->tx_slot = 0; @@ -1243,6 +1246,7 @@ static void emac_poll_tx(void *param) DBG2("%d: poll_tx, %d %d" NL, dev->def->index, dev->tx_cnt, dev->ack_slot); + spin_lock(&dev->tx_lock); if (dev->tx_cnt) { u16 ctrl; int slot = dev->ack_slot, n = 0; @@ -1252,6 +1256,7 @@ static void emac_poll_tx(void *param) struct sk_buff *skb = dev->tx_skb[slot]; ++n; + spin_unlock(&dev->tx_lock); if (skb) { dev_kfree_skb(skb); dev->tx_skb[slot] = NULL; @@ -1261,6 +1266,7 @@ static void emac_poll_tx(void *param) if (unlikely(EMAC_IS_BAD_TX(ctrl))) emac_parse_tx_error(dev, ctrl); + spin_lock(&dev->tx_lock); if (--dev->tx_cnt) goto again; } @@ -1273,6 +1279,7 @@ static void emac_poll_tx(void *param) DBG2("%d: tx %d pkts" NL, dev->def->index, n); } } + spin_unlock(&dev->tx_lock); } static inline void emac_recycle_rx_skb(struct ocp_enet_private *dev, int slot, @@ -1966,6 +1973,7 @@ static int __init emac_probe(struct ocp_ dev->ldev = &ocpdev->dev; dev->def = ocpdev->def; SET_MODULE_OWNER(ndev); + spin_lock_init(&dev->tx_lock); /* Find MAL device we are connected to */ maldev = Index: linux/drivers/net/ibm_emac/ibm_emac_core.h =================================================================== --- linux.orig/drivers/net/ibm_emac/ibm_emac_core.h +++ linux/drivers/net/ibm_emac/ibm_emac_core.h @@ -193,6 +193,8 @@ struct ocp_enet_private { struct ibm_emac_error_stats estats; struct net_device_stats nstats; + spinlock_t tx_lock; + struct device* ldev; }; Index: linux/drivers/net/loopback.c =================================================================== --- linux.orig/drivers/net/loopback.c +++ linux/drivers/net/loopback.c @@ -153,14 +153,14 @@ static int loopback_xmit(struct sk_buff #endif dev->last_rx = jiffies; - /* it's OK to use __get_cpu_var() because BHs are off */ - lb_stats = &__get_cpu_var(pcpu_lstats); + lb_stats = &per_cpu(pcpu_lstats, get_cpu()); lb_stats->bytes += skb->len; lb_stats->packets++; + put_cpu(); netif_rx(skb); - return 0; + return(0); } static struct net_device_stats loopback_stats; Index: linux/drivers/net/netconsole.c =================================================================== --- linux.orig/drivers/net/netconsole.c +++ linux/drivers/net/netconsole.c @@ -74,11 +74,18 @@ static void write_msg(struct console *co if (!np.dev) return; - local_irq_save(flags); + /* + * A bit hairy. Netconsole uses mutexes (indirectly) and + * thus must have interrupts enabled: + */ + local_save_flags(flags); + local_irq_enable_rt(); for(left = len; left; ) { frag = min(left, MAX_PRINT_CHUNK); + WARN_ON_RT(irqs_disabled()); netpoll_send_udp(&np, msg, frag); + WARN_ON_RT(irqs_disabled()); msg += frag; left -= frag; } Index: linux/drivers/net/plip.c =================================================================== --- linux.orig/drivers/net/plip.c +++ linux/drivers/net/plip.c @@ -227,7 +227,10 @@ struct net_local { struct hh_cache *hh); spinlock_t lock; atomic_t kill_timer; - struct semaphore killed_timer_sem; + /* + * PREEMPT_RT: this isnt a mutex, it should be struct completion. + */ + struct compat_semaphore killed_timer_sem; }; static inline void enable_parport_interrupts (struct net_device *dev) Index: linux/drivers/net/ppp_async.c =================================================================== --- linux.orig/drivers/net/ppp_async.c +++ linux/drivers/net/ppp_async.c @@ -67,7 +67,7 @@ struct asyncppp { struct tasklet_struct tsk; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; struct ppp_channel chan; /* interface to generic ppp layer */ unsigned char obuf[OBUFSIZE]; }; Index: linux/drivers/net/ppp_synctty.c =================================================================== --- linux.orig/drivers/net/ppp_synctty.c +++ linux/drivers/net/ppp_synctty.c @@ -70,7 +70,7 @@ struct syncppp { struct tasklet_struct tsk; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; struct ppp_channel chan; /* interface to generic ppp layer */ }; Index: linux/drivers/net/sungem.c =================================================================== --- linux.orig/drivers/net/sungem.c +++ linux/drivers/net/sungem.c @@ -1037,10 +1037,8 @@ static int gem_start_xmit(struct sk_buff (csum_stuff_off << 21)); } - local_irq_save(flags); - if (!spin_trylock(&gp->tx_lock)) { + if (!spin_trylock_irqsave(&gp->tx_lock, flags)) { /* Tell upper layer to requeue */ - local_irq_restore(flags); return NETDEV_TX_LOCKED; } /* We raced with gem_do_stop() */ Index: linux/drivers/net/tulip/tulip_core.c =================================================================== --- linux.orig/drivers/net/tulip/tulip_core.c +++ linux/drivers/net/tulip/tulip_core.c @@ -1806,6 +1806,7 @@ static void __devexit tulip_remove_one ( pci_iounmap(pdev, tp->base_addr); free_netdev (dev); pci_release_regions (pdev); + pci_disable_device (pdev); pci_set_drvdata (pdev, NULL); /* pci_power_off (pdev, -1); */ Index: linux/drivers/oprofile/oprofilefs.c =================================================================== --- linux.orig/drivers/oprofile/oprofilefs.c +++ linux/drivers/oprofile/oprofilefs.c @@ -21,7 +21,7 @@ #define OPROFILEFS_MAGIC 0x6f70726f -DEFINE_SPINLOCK(oprofilefs_lock); +DEFINE_RAW_SPINLOCK(oprofilefs_lock); static struct inode * oprofilefs_get_inode(struct super_block * sb, int mode) { Index: linux/drivers/pci/access.c =================================================================== --- linux.orig/drivers/pci/access.c +++ linux/drivers/pci/access.c @@ -9,7 +9,7 @@ * configuration space. */ -static DEFINE_SPINLOCK(pci_lock); +static DEFINE_RAW_SPINLOCK(pci_lock); /* * Wrappers for all PCI configuration access functions. They just check Index: linux/drivers/pci/hotplug/cpci_hotplug_core.c =================================================================== --- linux.orig/drivers/pci/hotplug/cpci_hotplug_core.c +++ linux/drivers/pci/hotplug/cpci_hotplug_core.c @@ -59,8 +59,8 @@ static int slots; static atomic_t extracting; int cpci_debug; static struct cpci_hp_controller *controller; -static struct semaphore event_semaphore; /* mutex for process loop (up if something to process) */ -static struct semaphore thread_exit; /* guard ensure thread has exited before calling it quits */ +static struct compat_semaphore event_semaphore; /* mutex for process loop (up if something to process) */ +static struct compat_semaphore thread_exit; /* guard ensure thread has exited before calling it quits */ static int thread_finished = 1; static int enable_slot(struct hotplug_slot *slot); Index: linux/drivers/pci/hotplug/cpqphp_ctrl.c =================================================================== --- linux.orig/drivers/pci/hotplug/cpqphp_ctrl.c +++ linux/drivers/pci/hotplug/cpqphp_ctrl.c @@ -45,8 +45,8 @@ static int configure_new_function(struct u8 behind_bridge, struct resource_lists *resources); static void interrupt_event_handler(struct controller *ctrl); -static struct semaphore event_semaphore; /* mutex for process loop (up if something to process) */ -static struct semaphore event_exit; /* guard ensure thread has exited before calling it quits */ +static struct compat_semaphore event_semaphore; /* mutex for process loop (up if something to process) */ +static struct compat_semaphore event_exit; /* guard ensure thread has exited before calling it quits */ static int event_finished; static unsigned long pushbutton_pending; /* = 0 */ Index: linux/drivers/pci/hotplug/ibmphp_hpc.c =================================================================== --- linux.orig/drivers/pci/hotplug/ibmphp_hpc.c +++ linux/drivers/pci/hotplug/ibmphp_hpc.c @@ -106,7 +106,7 @@ static int tid_poll; static struct mutex sem_hpcaccess; // lock access to HPC static struct semaphore semOperations; // lock all operations and // access to data structures -static struct semaphore sem_exit; // make sure polling thread goes away +static struct compat_semaphore sem_exit; // make sure polling thread goes away //---------------------------------------------------------------------------- // local function prototypes //---------------------------------------------------------------------------- Index: linux/drivers/pci/hotplug/pciehp_ctrl.c =================================================================== --- linux.orig/drivers/pci/hotplug/pciehp_ctrl.c +++ linux/drivers/pci/hotplug/pciehp_ctrl.c @@ -37,8 +37,8 @@ static void interrupt_event_handler(struct controller *ctrl); -static struct semaphore event_semaphore; /* mutex for process loop (up if something to process) */ -static struct semaphore event_exit; /* guard ensure thread has exited before calling it quits */ +static struct compat_semaphore event_semaphore; /* mutex for process loop (up if something to process) */ +static struct compat_semaphore event_exit; /* guard ensure thread has exited before calling it quits */ static int event_finished; static unsigned long pushbutton_pending; /* = 0 */ static unsigned long surprise_rm_pending; /* = 0 */ Index: linux/drivers/pci/msi.c =================================================================== --- linux.orig/drivers/pci/msi.c +++ linux/drivers/pci/msi.c @@ -24,7 +24,7 @@ #include "pci.h" #include "msi.h" -static DEFINE_SPINLOCK(msi_lock); +static DEFINE_RAW_SPINLOCK(msi_lock); static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL }; static kmem_cache_t* msi_cachep; Index: linux/drivers/pci/pcie/aer/aerdrv.c =================================================================== --- linux.orig/drivers/pci/pcie/aer/aerdrv.c +++ linux/drivers/pci/pcie/aer/aerdrv.c @@ -157,7 +157,7 @@ static struct aer_rpc* aer_alloc_rpc(str * Initialize Root lock access, e_lock, to Root Error Status Reg, * Root Error ID Reg, and Root error producer/consumer index. */ - rpc->e_lock = SPIN_LOCK_UNLOCKED; + spin_lock_init(&rpc->e_lock); rpc->rpd = dev; INIT_WORK(&rpc->dpc_handler, aer_isr, (void *)dev); Index: linux/drivers/scsi/aacraid/aacraid.h =================================================================== --- linux.orig/drivers/scsi/aacraid/aacraid.h +++ linux/drivers/scsi/aacraid/aacraid.h @@ -737,7 +737,7 @@ struct aac_fib_context { u32 unique; // unique value representing this context ulong jiffies; // used for cleanup - dmb changed to ulong struct list_head next; // used to link context's into a linked list - struct semaphore wait_sem; // this is used to wait for the next fib to arrive. + struct compat_semaphore wait_sem; // this is used to wait for the next fib to arrive. int wait; // Set to true when thread is in WaitForSingleObject unsigned long count; // total number of FIBs on FibList struct list_head fib_list; // this holds fibs and their attachd hw_fibs @@ -807,7 +807,7 @@ struct fib { * This is the event the sendfib routine will wait on if the * caller did not pass one and this is synch io. */ - struct semaphore event_wait; + struct compat_semaphore event_wait; spinlock_t event_lock; u32 done; /* gets set to 1 when fib is complete */ Index: linux/drivers/scsi/qla2xxx/qla_def.h =================================================================== --- linux.orig/drivers/scsi/qla2xxx/qla_def.h +++ linux/drivers/scsi/qla2xxx/qla_def.h @@ -2317,7 +2317,7 @@ typedef struct scsi_qla_host { spinlock_t mbx_reg_lock; /* Mbx Cmd Register Lock */ struct semaphore mbx_cmd_sem; /* Serialialize mbx access */ - struct semaphore mbx_intr_sem; /* Used for completion notification */ + struct compat_semaphore mbx_intr_sem; /* Used for completion notification */ uint32_t mbx_flags; #define MBX_IN_PROGRESS BIT_0 Index: linux/drivers/scsi/scsi.c =================================================================== --- linux.orig/drivers/scsi/scsi.c +++ linux/drivers/scsi/scsi.c @@ -871,9 +871,9 @@ EXPORT_SYMBOL(scsi_device_get); */ void scsi_device_put(struct scsi_device *sdev) { +#ifdef CONFIG_MODULE_UNLOAD struct module *module = sdev->host->hostt->module; -#ifdef CONFIG_MODULE_UNLOAD /* The module refcount will be zero if scsi_device_get() * was called from a module removal routine */ if (module && module_refcount(module) != 0) Index: linux/drivers/serial/8250.c =================================================================== --- linux.orig/drivers/serial/8250.c +++ linux/drivers/serial/8250.c @@ -2263,14 +2263,10 @@ serial8250_console_write(struct console touch_nmi_watchdog(); - local_irq_save(flags); - if (up->port.sysrq) { - /* serial8250_handle_port() already took the lock */ - locked = 0; - } else if (oops_in_progress) { - locked = spin_trylock(&up->port.lock); - } else - spin_lock(&up->port.lock); + if (up->port.sysrq || oops_in_progress) + locked = spin_trylock_irqsave(&up->port.lock, flags); + else + spin_lock_irqsave(&up->port.lock, flags); /* * First save the IER then disable the interrupts @@ -2292,8 +2288,7 @@ serial8250_console_write(struct console serial_out(up, UART_IER, ier); if (locked) - spin_unlock(&up->port.lock); - local_irq_restore(flags); + spin_unlock_irqrestore(&up->port.lock, flags); } static int serial8250_console_setup(struct console *co, char *options) Index: linux/drivers/usb/core/devio.c =================================================================== --- linux.orig/drivers/usb/core/devio.c +++ linux/drivers/usb/core/devio.c @@ -309,10 +309,11 @@ static void async_completed(struct urb * struct async *as = urb->context; struct dev_state *ps = as->ps; struct siginfo sinfo; + unsigned long flags; - spin_lock(&ps->lock); - list_move_tail(&as->asynclist, &ps->async_completed); - spin_unlock(&ps->lock); + spin_lock_irqsave(&ps->lock, flags); + list_move_tail(&as->asynclist, &ps->async_completed); + spin_unlock_irqrestore(&ps->lock, flags); if (as->signr) { sinfo.si_signo = as->signr; sinfo.si_errno = as->urb->status; Index: linux/drivers/usb/core/hcd.c =================================================================== --- linux.orig/drivers/usb/core/hcd.c +++ linux/drivers/usb/core/hcd.c @@ -517,13 +517,11 @@ error: } /* any errors get returned through the urb completion */ - local_irq_save (flags); - spin_lock (&urb->lock); + spin_lock_irqsave(&urb->lock, flags); if (urb->status == -EINPROGRESS) urb->status = status; - spin_unlock (&urb->lock); + spin_unlock_irqrestore(&urb->lock, flags); usb_hcd_giveback_urb (hcd, urb); - local_irq_restore (flags); return 0; } @@ -551,8 +549,7 @@ void usb_hcd_poll_rh_status(struct usb_h if (length > 0) { /* try to complete the status urb */ - local_irq_save (flags); - spin_lock(&hcd_root_hub_lock); + spin_lock_irqsave(&hcd_root_hub_lock, flags); urb = hcd->status_urb; if (urb) { spin_lock(&urb->lock); @@ -568,14 +565,13 @@ void usb_hcd_poll_rh_status(struct usb_h spin_unlock(&urb->lock); } else length = 0; - spin_unlock(&hcd_root_hub_lock); + spin_unlock_irqrestore(&hcd_root_hub_lock, flags); /* local irqs are always blocked in completions */ if (length > 0) usb_hcd_giveback_urb (hcd, urb); else hcd->poll_pending = 1; - local_irq_restore (flags); } /* The USB 2.0 spec says 256 ms. This is close enough and won't @@ -647,17 +643,15 @@ static int usb_rh_urb_dequeue (struct us } else { /* Status URB */ if (!hcd->uses_new_polling) del_timer (&hcd->rh_timer); - local_irq_save (flags); - spin_lock (&hcd_root_hub_lock); + spin_lock_irqsave(&hcd_root_hub_lock, flags); if (urb == hcd->status_urb) { hcd->status_urb = NULL; urb->hcpriv = NULL; } else urb = NULL; /* wasn't fully queued */ - spin_unlock (&hcd_root_hub_lock); + spin_unlock_irqrestore(&hcd_root_hub_lock, flags); if (urb) usb_hcd_giveback_urb (hcd, urb); - local_irq_restore (flags); } return 0; @@ -1311,11 +1305,9 @@ void usb_hcd_endpoint_disable (struct us WARN_ON (!HC_IS_RUNNING (hcd->state) && hcd->state != HC_STATE_HALT && udev->state != USB_STATE_NOTATTACHED); - local_irq_disable (); - /* ep is already gone from udev->ep_{in,out}[]; no more submits */ rescan: - spin_lock (&hcd_data_lock); + spin_lock_irq(&hcd_data_lock); list_for_each_entry (urb, &ep->urb_list, urb_list) { int tmp; @@ -1323,13 +1315,13 @@ rescan: if (urb->status != -EINPROGRESS) continue; usb_get_urb (urb); - spin_unlock (&hcd_data_lock); + spin_unlock_irq(&hcd_data_lock); - spin_lock (&urb->lock); + spin_lock_irq(&urb->lock); tmp = urb->status; if (tmp == -EINPROGRESS) urb->status = -ESHUTDOWN; - spin_unlock (&urb->lock); + spin_unlock_irq(&urb->lock); /* kick hcd unless it's already returning this */ if (tmp == -EINPROGRESS) { @@ -1352,8 +1344,7 @@ rescan: /* list contents may have changed */ goto rescan; } - spin_unlock (&hcd_data_lock); - local_irq_enable (); + spin_unlock_irq(&hcd_data_lock); /* synchronize with the hardware, so old configuration state * clears out immediately (and will be freed). Index: linux/drivers/usb/core/message.c =================================================================== --- linux.orig/drivers/usb/core/message.c +++ linux/drivers/usb/core/message.c @@ -249,8 +249,9 @@ static void sg_clean (struct usb_sg_requ static void sg_complete (struct urb *urb) { struct usb_sg_request *io = urb->context; + unsigned long flags; - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); /* In 2.5 we require hcds' endpoint queues not to progress after fault * reports, until the completion callback (this!) returns. That lets @@ -284,7 +285,7 @@ static void sg_complete (struct urb *urb * unlink pending urbs so they won't rx/tx bad data. * careful: unlink can sometimes be synchronous... */ - spin_unlock (&io->lock); + spin_unlock_irqrestore (&io->lock, flags); for (i = 0, found = 0; i < io->entries; i++) { if (!io->urbs [i] || !io->urbs [i]->dev) continue; @@ -299,7 +300,7 @@ static void sg_complete (struct urb *urb } else if (urb == io->urbs [i]) found = 1; } - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); } urb->dev = NULL; @@ -309,7 +310,7 @@ static void sg_complete (struct urb *urb if (!io->count) complete (&io->complete); - spin_unlock (&io->lock); + spin_unlock_irqrestore (&io->lock, flags); } @@ -571,7 +572,7 @@ void usb_sg_cancel (struct usb_sg_reques dev_warn (&io->dev->dev, "%s, unlink --> %d\n", __FUNCTION__, retval); } - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); } spin_unlock_irqrestore (&io->lock, flags); } Index: linux/drivers/usb/net/usbnet.c =================================================================== --- linux.orig/drivers/usb/net/usbnet.c +++ linux/drivers/usb/net/usbnet.c @@ -898,6 +898,8 @@ static void tx_complete (struct urb *urb urb->dev = NULL; entry->state = tx_done; + spin_lock_rt(&dev->txq.lock); + spin_unlock_rt(&dev->txq.lock); defer_bh(dev, skb, &dev->txq); } Index: linux/drivers/usb/storage/usb.h =================================================================== --- linux.orig/drivers/usb/storage/usb.h +++ linux/drivers/usb/storage/usb.h @@ -147,7 +147,7 @@ struct us_data { dma_addr_t iobuf_dma; /* mutual exclusion and synchronization structures */ - struct semaphore sema; /* to sleep thread on */ + struct compat_semaphore sema; /* to sleep thread on */ struct completion notify; /* thread begin/end */ wait_queue_head_t delay_wait; /* wait during scan, reset */ Index: linux/drivers/video/console/fbcon.c =================================================================== --- linux.orig/drivers/video/console/fbcon.c +++ linux/drivers/video/console/fbcon.c @@ -1247,7 +1247,6 @@ static void fbcon_clear(struct vc_data * { struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]]; struct fbcon_ops *ops = info->fbcon_par; - struct display *p = &fb_display[vc->vc_num]; u_int y_break; @@ -1276,10 +1275,11 @@ static void fbcon_putcs(struct vc_data * struct display *p = &fb_display[vc->vc_num]; struct fbcon_ops *ops = info->fbcon_par; - if (!fbcon_is_inactive(vc, info)) + if (!fbcon_is_inactive(vc, info)) { ops->putcs(vc, info, s, count, real_y(p, ypos), xpos, get_color(vc, info, scr_readw(s), 1), get_color(vc, info, scr_readw(s), 0)); + } } static void fbcon_putc(struct vc_data *vc, int c, int ypos, int xpos) @@ -3085,6 +3085,7 @@ static const struct consw fb_con = { .con_screen_pos = fbcon_screen_pos, .con_getxy = fbcon_getxy, .con_resize = fbcon_resize, + .con_preemptible = 1, }; static struct notifier_block fbcon_event_notifier = { Index: linux/drivers/video/console/vgacon.c =================================================================== --- linux.orig/drivers/video/console/vgacon.c +++ linux/drivers/video/console/vgacon.c @@ -52,7 +52,7 @@ #include