Help to evaluate the system crash resistance method

Discussion in 'Windows Vista Drivers' started by darkside, Jan 16, 2006.

  1. darkside

    darkside Guest


    I just thought a approach to hold system crash caused by those tiny SW
    problems, please kindly help to evaluate:

    1. Hacking the IDT, replace those exception vector like memory
    violation,divided by zero with our handling logic.

    2. At our handling logic, if the IRQL is above dispatch, transfer the
    control to original OS handler, which will show blue screen at last; but if
    not, use KeDelayExecutionThreadexction() to hold the problem thread for a
    while, then kernel will re-schedule to other threads.In this case, system
    still alive instead of going to blue screen.

    I've been dedicated to fixing kernel bugs for several years, feel very pity
    to see many times the system dead just because of a tiny driver problem.
    Would think to develop a kernel piece that can help this...It NOT targets
    for helping all the system crash cases - I'm aware of many crash cases are
    so severe that it is no use even if you can hold it for a while, it targets
    for those SW problems like DBZ, memory violate etc...each of them has a
    seperate item at IDT which can be selectively replaced.

    My questions here are:
    1. Is there any formal way for us to get the IDT address and selectively
    replace some of IDT items?
    2. How long can the KeDelayExecutionThreadexction() hold the problem thread
    in practice?
    3. Will the overall mechnism work when driver code raises a kernel crash?

    Thank you!

    darkside, Jan 16, 2006
    1. Advertisements

  2. darkside

    Don Burn Guest

    As some one who has worked in the fault tolerant part of the computer
    industry, the problem is a lot harder than you imagine. I know of a number
    of companies working on potential solutions (I am a founder of one them),
    but you are not going to see discussions here of their technology, you won't
    get that without an NDA.

    What I can say is you either have to wrap a layer of protection around the
    whole driver (such as moving it into its own address space) or provide a way
    to capture enough system state to go back before the crash and do something
    to avert it. Neither of these is a small task such as tweaking an IDT
    member, and neither can easily be explained in a newsgroup.
    Don Burn, Jan 16, 2006
    1. Advertisements

  3. darkside

    darkside Guest

    I don't know what's your mean of "such as moving it into its own address
    space"(moving which into whose address space?), can you explain why?

    Regarding to the system state, I think the processor should reserve most of
    them if not all, it is a standard kernel exception handling mechnism, the
    processor and OS know which things they should keep in mind...
    darkside, Jan 16, 2006
  4. darkside

    Don Burn Guest

    There are two basic mechanisms for fault tolerance, either masking or
    checkpointing. In this case you can mask the fault by making sure that the
    system is not impacted by the bad driver. But the only way of doing this
    masking is to make sure the driver does not modify any system memory, so put
    it in its own address space (i.e. run it like a process) and then if you
    have to throw away the bad driver you are safe.

    Checkpointing, means you retain the system state so when bad things happen
    you can restore it. The problem here is you need to know everything the
    driver did and all other things that depended on what the driver did (for
    instance, setting a dispatcher object in the kernel may cause other drivers
    to believe that certain things are correct, but the misbehaving driver was
    wrong), and restore them to a state prior to the problem, so you can do
    something to fix it.

    The OS and the processor do nothing to record what they need to restore a
    system, this is a fallacy.
    Don Burn, Jan 16, 2006
  5. darkside

    darkside Guest

    ok,thanks for the nice input, I understand those are another ways to keep
    system safe when one driver got problem and they look make sense. But would
    you point out in specific what's the problem of my approach? It targets for
    hold the problem thread for a while so that the whole system can be alive
    for a while(other healthy thead will get the control),maybe the system is
    not safe any more because the root cause is still there, but user is able to
    save their work or do other rescue work during this precious time.
    darkside, Jan 16, 2006
  6. darkside

    Don Burn Guest

    Assuming you can make it work at all, you are catching a failure when the
    driver has done things like written to memory that does not exist. But
    when a driver goes bad it it likely to have had multiple failures so how do
    you know it did not overwrite a pointer being passed to the disk driver
    which causes that driver to wipe out part of the disk structure, or some
    other similar disaster. So now instead of a reboot, you have lost your
    whole drive and all the data since your last backup. This is why system
    like Windows "fail safe" to fail early enough that really bad things cannot
    happen. You are bypassing the fail safes of the operating system.
    Don Burn, Jan 16, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.