Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU hang on Cherry Trail #7

Closed
johalun opened this issue May 26, 2016 · 32 comments
Closed

GPU hang on Cherry Trail #7

johalun opened this issue May 26, 2016 · 32 comments
Assignees

Comments

@johalun
Copy link
Member

johalun commented May 26, 2016

img_5703

@nomadlogic
Copy link

hey there - were you able to capture a core from this panic? if so it may be helpful to post the full backtrace in this issue.

@johalun
Copy link
Member Author

johalun commented May 26, 2016

Sorry no core. A second after this output the system automatically reboots and no core or anything remains..

@johalun
Copy link
Member Author

johalun commented May 26, 2016

I could get a core. It seems you need a swap partition for that.. Since i run on USB memory I deactivated swap...

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:221
#1  0xffffffff80a409e5 in kern_reboot (howto=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a40fbb in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:767
#3  0xffffffff80a41003 in panic (fmt=0x0) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:690
#4  0xffffffff80eaf461 in trap_fatal (frame=0xfffffe0048f5d0d0, eva=16) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:841
#5  0xffffffff80eaf66d in trap_pfault (frame=0xfffffe0048f5d0d0, usermode=0) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:691
#6  0xffffffff80eaeb54 in trap (frame=0xfffffe0048f5d0d0) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:442
#7  0xffffffff80e8ef31 in calltrap () at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/exception.S:236
#8  0xffffffff82ba1b19 in pci_dev_put (pdev=0x0) at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:386
#9  0xffffffff82a0ab37 in intel_detect_pch (dev=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_drv.c:522
#10 0xffffffff82a0913c in i915_driver_load (dev=0xfffff80006903000, flags=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_dma.c:1048
#11 0xffffffff82b38555 in drm_dev_register (dev=0xfffff80006903000, flags=18446744071606949548)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_drv.c:785
#12 0xffffffff82b518f9 in drm_get_pci_dev (pdev=0xfffff80003a1f000, ent=0xffffffff82acc5d0, driver=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_pci.c:323
#13 0xffffffff82ba1f83 in linux_pci_attach (dev=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:193
#14 0xffffffff80a750f0 in device_attach (dev=0xfffff8000395d000) at device_if.h:180
#15 0xffffffff80a767d6 in bus_generic_driver_added (dev=<value optimized out>, driver=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/subr_bus.c:2858
#16 0xffffffff80a72abd in devclass_driver_added (dc=<value optimized out>, driver=<value optimized out>) at bus_if.h:204
#17 0xffffffff80a729e1 in devclass_add_driver (dc=<value optimized out>, driver=<value optimized out>, pass=<value optimized out>, dcp=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/subr_bus.c:1172
#18 0xffffffff82ba1816 in pci_register_driver (pdrv=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:297
#19 0xffffffff82a0cd7c in _module_run (arg=<value optimized out>) at module.h:80
#20 0xffffffff80a14478 in linker_load_module (kldname=<value optimized out>, modname=0xfffff80003ed7800 "i915kms", parent=<value optimized out>, 
    verinfo=<value optimized out>, lfpp=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:230
#21 0xffffffff80a15ad7 in kern_kldload (td=<value optimized out>, file=<value optimized out>, fileid=0xfffffe0048f5dac4)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:1037
#22 0xffffffff80a15b9b in sys_kldload (td=0xfffff8000651e000, uap=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:1063
#23 0xffffffff80eafc1b in amd64_syscall (td=0xfffff8000651e000, traced=0) at subr_syscall.c:135
#24 0xffffffff80e8f21b in Xfast_syscall () at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/exception.S:396
#25 0x000000080086d12a in ?? ()

@mattmacy
Copy link
Member

Please try the latest.

@mattmacy mattmacy self-assigned this May 28, 2016
@johalun
Copy link
Member Author

johalun commented May 28, 2016

Got a bit further this time.

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:221
#1  0xffffffff80a40b85 in kern_reboot (howto=<value optimized out>) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a4115b in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:767
#3  0xffffffff80a411a3 in panic (fmt=0x0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:690
#4  0xffffffff80eaf401 in trap_fatal (frame=0xfffffe0048ef0fb0, eva=0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/trap.c:841
#5  0xffffffff80eaf090 in trap (frame=0xfffffe0048ef0fb0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/trap.c:203
#6  0xffffffff80e8f881 in calltrap () at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/exception.S:236
#7  0xffffffff82b24ef0 in drm_clflush_virt_range (addr=0xfffff8000a70b000, length=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_cache.c:139
#8  0xffffffff82a21271 in __hw_ppgtt_init (dev=<value optimized out>, ppgtt=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_gtt.c:362
#9  0xffffffff82a21cac in i915_ppgtt_create (dev=0xfffff80003f9c000, fpriv=0x0)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_gtt.c:2159
#10 0xffffffff82a19978 in i915_gem_create_context (dev=<value optimized out>, file_priv=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_context.c:299
#11 0xffffffff82a19696 in i915_gem_context_init (dev=0xfffff80003f9c000)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_context.c:391
#12 0xffffffff82a16cc0 in i915_gem_init (dev=0xfffff80003f9c000)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem.c:5039
#13 0xffffffff82a09b25 in i915_driver_load (dev=<value optimized out>, flags=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_dma.c:414
#14 0xffffffff82b3b4d5 in drm_dev_register (dev=0xfffff80003f9c000, flags=18446744071606956220)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_drv.c:785
#15 0xffffffff82b55de9 in drm_prime_pages_to_sg (pages=<value optimized out>, nr_pages=<value optimized out>) at scatterlist.h:110
#16 0xffffffff82ba7839 in linux_pci_attach (dev=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:210
#17 0xffffffff80a75320 in device_attach (dev=0xfffff8000393a600) at device_if.h:180
#18 0xffffffff80a76a06 in bus_generic_driver_added (dev=<value optimized out>, driver=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/subr_bus.c:2858
#19 0xffffffff80a72ced in devclass_driver_added (dc=<value optimized out>, driver=<value optimized out>) at bus_if.h:204
#20 0xffffffff80a72c11 in devclass_add_driver (dc=<value optimized out>, driver=<value optimized out>, pass=<value optimized out>, dcp=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/subr_bus.c:1172
#21 0xffffffff82ba704f in pci_register_driver (pdrv=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:327
#22 0xffffffff82a0cd2c in _module_run (arg=<value optimized out>) at module.h:80
#23 0xffffffff80a14618 in linker_load_module (kldname=<value optimized out>, modname=0xfffff800039f3000 "i915kms", parent=<value optimized out>, 
    verinfo=<value optimized out>, lfpp=<value optimized out>) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:230
---Type <return> to continue, or q <return> to quit---
#24 0xffffffff80a15c77 in kern_kldload (td=<value optimized out>, file=<value optimized out>, fileid=0xfffffe0048ef1ac4)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:1037
#25 0xffffffff80a15d3b in sys_kldload (td=0xfffff80003ef0000, uap=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:1063
#26 0xffffffff80eafbbb in amd64_syscall (td=0xfffff80003ef0000, traced=0) at subr_syscall.c:135
#27 0xffffffff80e8fb6b in Xfast_syscall () at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/exception.S:396
#28 0x000000080086d12a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal

@mattmacy
Copy link
Member

It looks like it may be trying to use clflushopt on an unsupported processor. Can you try the latest and see if it works? If not I'll just use clflush.

Thanks.

@johalun
Copy link
Member Author

johalun commented May 30, 2016

Getting closer :)

fullsizerender 2

@mattmacy
Copy link
Member

Hardly shippable, but this is definitely progress.

@mattmacy mattmacy changed the title Kernel panic on Cherry Trail when kldload i915kms GPU hang on Cherry Trail May 30, 2016
@johalun
Copy link
Member Author

johalun commented May 30, 2016

It actually renders something when I start X but the screen content is all messed up.. Like tiles repeating in X and Y.

@mattmacy
Copy link
Member

What does the log show if you set dev.drm.drm_debug=-1 (after loading i915kms) before starting X?

@johalun
Copy link
Member Author

johalun commented May 30, 2016

Here's the log. From boot to kldload i915kms and start/stop X a couple of times.

messages.txt.zip

@mattmacy
Copy link
Member

This is returning EIO. I'll have to dig in to which path is doing that.

/* Throttle our rendering by waiting until the ring has completed our requests
 * emitted over 20 msec ago.
 *
 * Note that if we were to use the current jiffies each time around the loop,
 * we wouldn't escape the function with any frames outstanding if the time to
 * render a frame was over 20ms.
 *
 * This should get us reasonable parallelism between CPU and GPU but also
 * relatively low latency when blocking on a particular request to finish.
 */
static int
i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    struct drm_i915_file_private *file_priv = file->driver_priv;
    unsigned long recent_enough = jiffies - DRM_I915_THROTTLE_JIFFIES;
    struct drm_i915_gem_request *request, *target = NULL;
    unsigned reset_counter;
    int ret;

    ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
    if (ret)
        return ret;

    ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
    if (ret)
        return ret;

    spin_lock(&file_priv->mm.lock);
    list_for_each_entry(request, &file_priv->mm.request_list, client_list) {
        if (time_after_eq(request->emitted_jiffies, recent_enough))
            break;

        /*
         * Note that the request might not have been submitted yet.
         * In which case emitted_jiffies will be zero.
         */
        if (!request->emitted_jiffies)
            continue;

        target = request;
    }
    reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
    if (target)
        i915_gem_request_reference(target);
    spin_unlock(&file_priv->mm.lock);

    if (target == NULL)
        return 0;

    ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
    if (ret == 0)
        queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);

    i915_gem_request_unreference__unlocked(target);

    return ret;
}

@johalun
Copy link
Member Author

johalun commented May 30, 2016

I will add some printf's and test.

@johalun
Copy link
Member Author

johalun commented May 30, 2016

Ok. I get the many returns at
ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);

@mattmacy
Copy link
Member

Which one though?

int
i915_gem_check_wedge(struct i915_gpu_error *error,
             bool interruptible)
{
    if (i915_reset_in_progress(error)) {
        /* Non-interruptible callers can't handle -EAGAIN, hence return
         * -EIO unconditionally for these. */
        if (!interruptible)
            return -EIO;

        /* Recovery complete, but the reset failed ... */
        if (i915_terminally_wedged(error))
            return -EIO;

        /*
         * Check if GPU Reset is in progress - we need intel_ring_begin
         * to work properly to reinit the hw state while the gpu is
         * still marked as reset-in-progress. Handle this with a flag.
         */
        if (!error->reload_in_reset)
            return -EAGAIN;
    }

    return 0;
}

@mattmacy
Copy link
Member

/**
 * i915_reset_and_wakeup - do process context error handling work
 * @dev: drm device
 *
 * Fire an error uevent so userspace can see that a hang or error
 * was detected.
 */
static void i915_reset_and_wakeup(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = to_i915(dev);
    struct i915_gpu_error *error = &dev_priv->gpu_error;
    char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
    char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
    char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
    int ret;

    kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);

    /*
     * Note that there's only one work item which does gpu resets, so we
     * need not worry about concurrent gpu resets potentially incrementing
     * error->reset_counter twice. We only need to take care of another
     * racing irq/hangcheck declaring the gpu dead for a second time. A
     * quick check for that is good enough: schedule_work ensures the
     * correct ordering between hang detection and this work item, and since
     * the reset in-progress bit is only ever set by code outside of this
     * work we don't need to worry about any other races.
     */
    if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
        DRM_DEBUG_DRIVER("resetting chip\n");
        kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
                   reset_event);

        /*
         * In most cases it's guaranteed that we get here with an RPM
         * reference held, for example because there is a pending GPU
         * request that won't finish until the reset is done. This
         * isn't the case at least when we get here by doing a
         * simulated reset via debugs, so get an RPM reference.
         */
        intel_runtime_pm_get(dev_priv);

        intel_prepare_reset(dev);

        /*
         * All state reset _must_ be completed before we update the
         * reset counter, for otherwise waiters might miss the reset
         * pending state and not properly drop locks, resulting in
         * deadlocks with the reset work.
         */
        ret = i915_reset(dev); <---- THIS IS RETURNING NON-ZERO

        intel_finish_reset(dev);

        intel_runtime_pm_put(dev_priv);

        if (ret == 0) {
            /*
             * After all the gem state is reset, increment the reset
             * counter and wake up everyone waiting for the reset to
             * complete.
             *
             * Since unlock operations are a one-sided barrier only,
             * we need to insert a barrier here to order any seqno
             * updates before
             * the counter increment.
             */
            smp_mb__before_atomic();
            atomic_inc(&dev_priv->gpu_error.reset_counter);

            kobject_uevent_env(&dev->primary->kdev->kobj,
                       KOBJ_CHANGE, reset_done_event);
        } else {
            atomic_or(I915_WEDGED, &error->reset_counter); <---- WHICH GETS US HERE
        }

        /*
         * Note: The wake_up also serves as a memory barrier so that
         * waiters see the update value of the reset counter atomic_t.
         */
        i915_error_wake_up(dev_priv, true);
    }
}

@mattmacy
Copy link
Member

/**
 * i915_reset - reset chip after a hang
 * @dev: drm device to reset
 *
 * Reset the chip.  Useful if a hang is detected. Returns zero on successful
 * reset or otherwise an error code.
 *
 * Procedure is fairly simple:
 *   - reset the chip using the reset reg
 *   - re-init context state
 *   - re-init hardware status page
 *   - re-init ring buffer
 *   - re-init interrupt state
 *   - re-init display
 */
int i915_reset(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    bool simulated;
    int ret;

    intel_reset_gt_powersave(dev);

    mutex_lock(&dev->struct_mutex);

    i915_gem_reset(dev);

    simulated = dev_priv->gpu_error.stop_rings != 0;

    ret = intel_gpu_reset(dev); <---- EITHER THIS FAILS

    /* Also reset the gpu hangman. */
    if (simulated) {
        DRM_INFO("Simulated gpu hang, resetting stop_rings\n");
        dev_priv->gpu_error.stop_rings = 0;
        if (ret == -ENODEV) {
            DRM_INFO("Reset not implemented, but ignoring "
                 "error for simulated gpu hangs\n");
            ret = 0;
        }
    }

    if (i915_stop_ring_allow_warn(dev_priv))
        pr_notice("drm/i915: Resetting chip after gpu hang\n");

    if (ret) {
        DRM_ERROR("Failed to reset chip: %i\n", ret);
        mutex_unlock(&dev->struct_mutex);
        return ret;
    }

    intel_overlay_reset(dev_priv);

    /* Ok, now get things going again... */

    /*
     * Everything depends on having the GTT running, so we need to start
     * there.  Fortunately we don't need to do this unless we reset the
     * chip at a PCI level.
     *
     * Next we need to restore the context, but we don't use those
     * yet either...
     *
     * Ring buffer needs to be re-initialized in the KMS case, or if X
     * was running at the time of the reset (i.e. we weren't VT
     * switched away).
     */

    /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset */
    dev_priv->gpu_error.reload_in_reset = true;

    ret = i915_gem_init_hw(dev); <---- OR THIS FAILS

    dev_priv->gpu_error.reload_in_reset = false;

    mutex_unlock(&dev->struct_mutex);
    if (ret) {
        DRM_ERROR("Failed hw init on reset %d\n", ret);
        return ret;
    }

@mattmacy
Copy link
Member

Maybe try and instrument this to make sure we're actually waiting for 500ms here?

static int gen6_do_reset(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    int ret;

    /* Reset the chip */

    /* GEN6_GDRST is not in the gt power well, no need to check
     * for fifo space for the write or forcewake the chip for
     * the read
     */
    __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_FULL);

    /* Spin waiting for the device to ack the reset request */
    ret = wait_for((__raw_i915_read32(dev_priv, GEN6_GDRST) & GEN6_GRDOM_FULL) == 0, 500); <----- THIS MAY BE DODGY

    intel_uncore_forcewake_reset(dev, true);

    return ret;
}

@johalun
Copy link
Member Author

johalun commented May 30, 2016

I get "reset in progress". What API can I use to time a function in a kernel driver?

@mattmacy
Copy link
Member

from sys/time.h:

/*
 * Functions for looking at our clock: [get]{bin,nano,micro}[up]time()
 *
 * Functions without the "get" prefix returns the best timestamp
 * we can produce in the given format.
 *
 * "bin"   == struct bintime  == seconds + 64 bit fraction of seconds.
 * "nano"  == struct timespec == seconds + nanoseconds.
 * "micro" == struct timeval  == seconds + microseconds.
 *
 * Functions containing "up" returns time relative to boot and
 * should be used for calculating time intervals.
 *
 * Functions without "up" returns UTC time.
 *
 * Functions with the "get" prefix returns a less precise result
 * much faster than the functions without "get" prefix and should
 * be used where a precision of 1/hz seconds is acceptable or where
 * performance is priority. (NB: "precision", _not_ "resolution" !)
 */

void    binuptime(struct bintime *bt);
void    nanouptime(struct timespec *tsp);
void    microuptime(struct timeval *tvp);

This is purely for instrumentation so I think the added overhead of the extra resolution is ok.
"microuptime" sounds like the way to go.

@johalun
Copy link
Member Author

johalun commented May 30, 2016

I don't get any output at all from gen6_do_reset()...

@johalun
Copy link
Member Author

johalun commented May 30, 2016

Getting -5 from intel_gpu_reset(). Btw, isn't cherryview gen8? Have to stop now but can keep digging tomorrow.

@mattmacy
Copy link
Member

Try sticking a BACKTRACE() in each of the reset functions to see which is getting called.

@mattmacy
Copy link
Member

gen6_do_reset is called from gen8_do_reset

@mattmacy
Copy link
Member

I'm on #freebsd-xorg on EFnet much of the time. Easier to discuss in real-time.

@johalun
Copy link
Member Author

johalun commented May 30, 2016

reset request timeouts so gen6_do_reset never gets called..

[drm:cherryview_enable_rps] setting GPU freq to 400 MHz (40)
[drm] stuck on render ring
[drm] stuck on blitter ring
[drm] stuck on bsd ring
[drm] stuck on video enhancement ring
[drm] GPU HANG: ecode 8:0:0x00201001, reason: Ring hung, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
[drm:i915_reset_and_wakeup] resetting chip
[drm:0xffffffff82a2b4bds] *ERROR* gpu hanging too fast, banning!
intel_gpu_reset() at intel_gpu_reset+0x1e/frame 0xfffffe004996c790
i915_reset() at i915_reset+0x6d/frame 0xfffffe004996c800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe004996c890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe004996c990
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x49f/frame 0xfffffe004996ca60
gen8_do_reset() at gen8_do_reset+0x1e/frame 0xfffffe004996c740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe004996c790
i915_reset() at i915_reset+0x6d/frame 0xfffffe004996c800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe004996c890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe004996c990
[drm:0xffffffff82ba00ccs] *ERROR* blitter ring: reset request timeout
i915_reset: intel_gpu_reset returns -5
drm/i915: Resetting chip after gpu hang
[drm:0xffffffff82a1511ds] *ERROR* Failed to reset chip: -5

@mattmacy
Copy link
Member

Where though?

@johalun
Copy link
Member Author

johalun commented May 30, 2016

Not sure what this means but at the second ring reset fails.

[drm:i915_reset_and_wakeup] resetting chip
[drm:0xffffffff82a2b4bds] ERROR gpu hanging too fast, banning!
intel_gpu_reset() at intel_gpu_reset+0x1e/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe0049976990
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x49f/frame 0xfffffe0049976a60
gen8_do_reset() at gen8_do_reset+0x21/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe0049976990
gen8_do_reset: for each ring i=0. dev=0xfffffe00016b2000. engine=0xfffffe00016b38d8.
wait_for_register() at wait_for_register+0x23/frame 0xfffffe00499766b0
gen8_do_reset() at gen8_do_reset+0x132/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
gen8_do_reset: for each ring i=1. dev=0xfffffe00016b2000. engine=0xfffffe00016b4b78.
wait_for_register() at wait_for_register+0x23/frame 0xfffffe00499766b0
gen8_do_reset() at gen8_do_reset+0x132/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
[drm:0xffffffff82ba0104s] ERROR blitter ring: reset request timeout
gen8_do_reset: for each ring do reset (not ready) i=0. dev=0xfffffe00016b2000. engine=0xfffffe00016b38d8.
gen8_do_reset: for each ring do reset (not ready) i=1. dev=0xfffffe00016b2000. engine=0xfffffe00016b4b78.
gen8_do_reset: for each ring do reset (not ready) i=2. dev=0xfffffe00016b2000. engine=0xfffffe00016b5e18.
gen8_do_reset: for each ring do reset (not ready) i=4. dev=0xfffffe00016b2000. engine=0xfffffe00016b8358.
i915_reset: intel_gpu_reset returns -5
drm/i915: Resetting chip after gpu hang
[drm:0xffffffff82a1511ds] ERROR Failed to reset chip: -5

@johalun
Copy link
Member Author

johalun commented May 30, 2016

I'm on IRC btw.

@mattmacy
Copy link
Member

mattmacy commented May 30, 2016

nick? I'm on #freebsd-xorg on Efnet.

@mattmacy
Copy link
Member

mattmacy commented Jun 4, 2016

img_0307_720

Works for me!

@mattmacy
Copy link
Member

mattmacy commented Jun 4, 2016

Please file a separate issue for any further problems.

@mattmacy mattmacy closed this as completed Jun 4, 2016
mattmacy pushed a commit that referenced this issue Jun 24, 2016
Drop scan generation number and node table scan lock - the only place
where ni_scangen is checked is in ieee80211_timeout_stations() (and it
is used to prevent duplicate checking of the same node); node scan lock
protects only this variable + node table scan generation number.

This will fix (at least) next LOR (hostap mode):

lock order reversal:
1st 0xc175f84c urtwm0_scan_loc (urtwm0_scan_loc) @ /usr/src/sys/modules/wlan/../../net80211/ieee80211_node.c:2019
2nd 0xc175e018 urtwm0_com_lock (urtwm0_com_lock) @ /usr/src/sys/modules/wlan/../../net80211/ieee80211_node.c:2693
stack backtrace:
#0 0xa070d1c5 at witness_debugger+0x75
#1 0xa070d0f6 at witness_checkorder+0xd46
#2 0xa0694cce at __mtx_lock_flags+0x9e
#3 0xb03ad9ef at ieee80211_node_leave+0x12f
#4 0xb03afd13 at ieee80211_timeout_stations+0x483
#5 0xb03aa1c2 at ieee80211_node_timeout+0x42
#6 0xa06c6fa1 at softclock_call_cc+0x1e1
#7 0xa06c7518 at softclock+0xc8
#8 0xa06789ae at intr_event_execute_handlers+0x8e
#9 0xa0678fa0 at ithread_loop+0x90
#10 0xa0675fbe at fork_exit+0x7e
#11 0xa08af910 at fork_trampoline+0x8

In addition to the above:

* switch to ieee80211_iterate_nodes();
* do not assert that node table lock is held, while calling node_age();
  that's not really needed (there are no resources, which can be protected
  by this lock) + this fixes LOR/deadlock between ieee80211_timeout_stations()
  and ieee80211_set_tim() (easy to reproduce in HOSTAP mode while
  sending something to an STA with enabled power management).

Tested:

* (avos) urtwn0, hostap mode
* (adrian) AR9380, STA mode
* (adrian) AR9380, AR9331, AR9580, hostap mode

Notes:

* This changes the net80211 internals, so you have to recompile all of it
  and the wifi drivers.

Submitted by:	avos
Approved by:	re (delphij)
Differential Revision:	https://reviews.freebsd.org/D6833
mjoras pushed a commit to mjoras/freebsd-base-graphics that referenced this issue Jun 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants