GPU hang on Cherry Trail #7

johalun · 2016-05-26T00:35:43Z

nomadlogic · 2016-05-26T02:31:01Z

hey there - were you able to capture a core from this panic? if so it may be helpful to post the full backtrace in this issue.

johalun · 2016-05-26T05:02:40Z

Sorry no core. A second after this output the system automatically reboots and no core or anything remains..

johalun · 2016-05-26T21:49:30Z

I could get a core. It seems you need a swap partition for that.. Since i run on USB memory I deactivated swap...

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:221
#1  0xffffffff80a409e5 in kern_reboot (howto=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a40fbb in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:767
#3  0xffffffff80a41003 in panic (fmt=0x0) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_shutdown.c:690
#4  0xffffffff80eaf461 in trap_fatal (frame=0xfffffe0048f5d0d0, eva=16) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:841
#5  0xffffffff80eaf66d in trap_pfault (frame=0xfffffe0048f5d0d0, usermode=0) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:691
#6  0xffffffff80eaeb54 in trap (frame=0xfffffe0048f5d0d0) at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/trap.c:442
#7  0xffffffff80e8ef31 in calltrap () at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/exception.S:236
#8  0xffffffff82ba1b19 in pci_dev_put (pdev=0x0) at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:386
#9  0xffffffff82a0ab37 in intel_detect_pch (dev=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_drv.c:522
#10 0xffffffff82a0913c in i915_driver_load (dev=0xfffff80006903000, flags=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_dma.c:1048
#11 0xffffffff82b38555 in drm_dev_register (dev=0xfffff80006903000, flags=18446744071606949548)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_drv.c:785
#12 0xffffffff82b518f9 in drm_get_pci_dev (pdev=0xfffff80003a1f000, ent=0xffffffff82acc5d0, driver=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_pci.c:323
#13 0xffffffff82ba1f83 in linux_pci_attach (dev=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:193
#14 0xffffffff80a750f0 in device_attach (dev=0xfffff8000395d000) at device_if.h:180
#15 0xffffffff80a767d6 in bus_generic_driver_added (dev=<value optimized out>, driver=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/subr_bus.c:2858
#16 0xffffffff80a72abd in devclass_driver_added (dc=<value optimized out>, driver=<value optimized out>) at bus_if.h:204
#17 0xffffffff80a729e1 in devclass_add_driver (dc=<value optimized out>, driver=<value optimized out>, pass=<value optimized out>, dcp=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/subr_bus.c:1172
#18 0xffffffff82ba1816 in pci_register_driver (pdrv=<value optimized out>)
    at /home/mirama/dev/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:297
#19 0xffffffff82a0cd7c in _module_run (arg=<value optimized out>) at module.h:80
#20 0xffffffff80a14478 in linker_load_module (kldname=<value optimized out>, modname=0xfffff80003ed7800 "i915kms", parent=<value optimized out>, 
    verinfo=<value optimized out>, lfpp=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:230
#21 0xffffffff80a15ad7 in kern_kldload (td=<value optimized out>, file=<value optimized out>, fileid=0xfffffe0048f5dac4)
    at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:1037
#22 0xffffffff80a15b9b in sys_kldload (td=0xfffff8000651e000, uap=<value optimized out>) at /home/mirama/dev/freebsd-base-graphics/sys/kern/kern_linker.c:1063
#23 0xffffffff80eafc1b in amd64_syscall (td=0xfffff8000651e000, traced=0) at subr_syscall.c:135
#24 0xffffffff80e8f21b in Xfast_syscall () at /home/mirama/dev/freebsd-base-graphics/sys/amd64/amd64/exception.S:396
#25 0x000000080086d12a in ?? ()

mattmacy · 2016-05-28T06:30:55Z

Please try the latest.

johalun · 2016-05-28T15:55:54Z

Got a bit further this time.

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:221
#1  0xffffffff80a40b85 in kern_reboot (howto=<value optimized out>) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a4115b in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:767
#3  0xffffffff80a411a3 in panic (fmt=0x0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_shutdown.c:690
#4  0xffffffff80eaf401 in trap_fatal (frame=0xfffffe0048ef0fb0, eva=0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/trap.c:841
#5  0xffffffff80eaf090 in trap (frame=0xfffffe0048ef0fb0) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/trap.c:203
#6  0xffffffff80e8f881 in calltrap () at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/exception.S:236
#7  0xffffffff82b24ef0 in drm_clflush_virt_range (addr=0xfffff8000a70b000, length=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_cache.c:139
#8  0xffffffff82a21271 in __hw_ppgtt_init (dev=<value optimized out>, ppgtt=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_gtt.c:362
#9  0xffffffff82a21cac in i915_ppgtt_create (dev=0xfffff80003f9c000, fpriv=0x0)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_gtt.c:2159
#10 0xffffffff82a19978 in i915_gem_create_context (dev=<value optimized out>, file_priv=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_context.c:299
#11 0xffffffff82a19696 in i915_gem_context_init (dev=0xfffff80003f9c000)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem_context.c:391
#12 0xffffffff82a16cc0 in i915_gem_init (dev=0xfffff80003f9c000)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_gem.c:5039
#13 0xffffffff82a09b25 in i915_driver_load (dev=<value optimized out>, flags=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/i915/i915kms/../../../../dev/drm2/i915/i915_dma.c:414
#14 0xffffffff82b3b4d5 in drm_dev_register (dev=0xfffff80003f9c000, flags=18446744071606956220)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/drm2/drm2/../../../dev/drm2/drm_drv.c:785
#15 0xffffffff82b55de9 in drm_prime_pages_to_sg (pages=<value optimized out>, nr_pages=<value optimized out>) at scatterlist.h:110
#16 0xffffffff82ba7839 in linux_pci_attach (dev=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:210
#17 0xffffffff80a75320 in device_attach (dev=0xfffff8000393a600) at device_if.h:180
#18 0xffffffff80a76a06 in bus_generic_driver_added (dev=<value optimized out>, driver=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/subr_bus.c:2858
#19 0xffffffff80a72ced in devclass_driver_added (dc=<value optimized out>, driver=<value optimized out>) at bus_if.h:204
#20 0xffffffff80a72c11 in devclass_add_driver (dc=<value optimized out>, driver=<value optimized out>, pass=<value optimized out>, dcp=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/subr_bus.c:1172
#21 0xffffffff82ba704f in pci_register_driver (pdrv=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/modules/linuxkpi/../../compat/linuxkpi/common/src/linux_pci.c:327
#22 0xffffffff82a0cd2c in _module_run (arg=<value optimized out>) at module.h:80
#23 0xffffffff80a14618 in linker_load_module (kldname=<value optimized out>, modname=0xfffff800039f3000 "i915kms", parent=<value optimized out>, 
    verinfo=<value optimized out>, lfpp=<value optimized out>) at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:230
---Type <return> to continue, or q <return> to quit---
#24 0xffffffff80a15c77 in kern_kldload (td=<value optimized out>, file=<value optimized out>, fileid=0xfffffe0048ef1ac4)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:1037
#25 0xffffffff80a15d3b in sys_kldload (td=0xfffff80003ef0000, uap=<value optimized out>)
    at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/kern/kern_linker.c:1063
#26 0xffffffff80eafbbb in amd64_syscall (td=0xfffff80003ef0000, traced=0) at subr_syscall.c:135
#27 0xffffffff80e8fb6b in Xfast_syscall () at /usr/home/johannes/dev/freebsd/freebsd-base-graphics/sys/amd64/amd64/exception.S:396
#28 0x000000080086d12a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal

mattmacy · 2016-05-29T20:50:04Z

It looks like it may be trying to use clflushopt on an unsupported processor. Can you try the latest and see if it works? If not I'll just use clflush.

Thanks.

johalun · 2016-05-30T00:33:13Z

Getting closer :)

mattmacy · 2016-05-30T00:35:18Z

Hardly shippable, but this is definitely progress.

johalun · 2016-05-30T00:57:21Z

It actually renders something when I start X but the screen content is all messed up.. Like tiles repeating in X and Y.

mattmacy · 2016-05-30T01:19:22Z

What does the log show if you set dev.drm.drm_debug=-1 (after loading i915kms) before starting X?

johalun · 2016-05-30T01:34:17Z

Here's the log. From boot to kldload i915kms and start/stop X a couple of times.

messages.txt.zip

mattmacy · 2016-05-30T01:43:25Z

This is returning EIO. I'll have to dig in to which path is doing that.

/* Throttle our rendering by waiting until the ring has completed our requests
 * emitted over 20 msec ago.
 *
 * Note that if we were to use the current jiffies each time around the loop,
 * we wouldn't escape the function with any frames outstanding if the time to
 * render a frame was over 20ms.
 *
 * This should get us reasonable parallelism between CPU and GPU but also
 * relatively low latency when blocking on a particular request to finish.
 */
static int
i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    struct drm_i915_file_private *file_priv = file->driver_priv;
    unsigned long recent_enough = jiffies - DRM_I915_THROTTLE_JIFFIES;
    struct drm_i915_gem_request *request, *target = NULL;
    unsigned reset_counter;
    int ret;

    ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
    if (ret)
        return ret;

    ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
    if (ret)
        return ret;

    spin_lock(&file_priv->mm.lock);
    list_for_each_entry(request, &file_priv->mm.request_list, client_list) {
        if (time_after_eq(request->emitted_jiffies, recent_enough))
            break;

        /*
         * Note that the request might not have been submitted yet.
         * In which case emitted_jiffies will be zero.
         */
        if (!request->emitted_jiffies)
            continue;

        target = request;
    }
    reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
    if (target)
        i915_gem_request_reference(target);
    spin_unlock(&file_priv->mm.lock);

    if (target == NULL)
        return 0;

    ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
    if (ret == 0)
        queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);

    i915_gem_request_unreference__unlocked(target);

    return ret;
}

johalun · 2016-05-30T02:11:28Z

I will add some printf's and test.

johalun · 2016-05-30T02:37:36Z

Ok. I get the many returns at
ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);

mattmacy · 2016-05-30T02:39:31Z

Which one though?

int
i915_gem_check_wedge(struct i915_gpu_error *error,
             bool interruptible)
{
    if (i915_reset_in_progress(error)) {
        /* Non-interruptible callers can't handle -EAGAIN, hence return
         * -EIO unconditionally for these. */
        if (!interruptible)
            return -EIO;

        /* Recovery complete, but the reset failed ... */
        if (i915_terminally_wedged(error))
            return -EIO;

        /*
         * Check if GPU Reset is in progress - we need intel_ring_begin
         * to work properly to reinit the hw state while the gpu is
         * still marked as reset-in-progress. Handle this with a flag.
         */
        if (!error->reload_in_reset)
            return -EAGAIN;
    }

    return 0;
}

mattmacy · 2016-05-30T02:41:10Z

/**
 * i915_reset_and_wakeup - do process context error handling work
 * @dev: drm device
 *
 * Fire an error uevent so userspace can see that a hang or error
 * was detected.
 */
static void i915_reset_and_wakeup(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = to_i915(dev);
    struct i915_gpu_error *error = &dev_priv->gpu_error;
    char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
    char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
    char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
    int ret;

    kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);

    /*
     * Note that there's only one work item which does gpu resets, so we
     * need not worry about concurrent gpu resets potentially incrementing
     * error->reset_counter twice. We only need to take care of another
     * racing irq/hangcheck declaring the gpu dead for a second time. A
     * quick check for that is good enough: schedule_work ensures the
     * correct ordering between hang detection and this work item, and since
     * the reset in-progress bit is only ever set by code outside of this
     * work we don't need to worry about any other races.
     */
    if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
        DRM_DEBUG_DRIVER("resetting chip\n");
        kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
                   reset_event);

        /*
         * In most cases it's guaranteed that we get here with an RPM
         * reference held, for example because there is a pending GPU
         * request that won't finish until the reset is done. This
         * isn't the case at least when we get here by doing a
         * simulated reset via debugs, so get an RPM reference.
         */
        intel_runtime_pm_get(dev_priv);

        intel_prepare_reset(dev);

        /*
         * All state reset _must_ be completed before we update the
         * reset counter, for otherwise waiters might miss the reset
         * pending state and not properly drop locks, resulting in
         * deadlocks with the reset work.
         */
        ret = i915_reset(dev); <---- THIS IS RETURNING NON-ZERO

        intel_finish_reset(dev);

        intel_runtime_pm_put(dev_priv);

        if (ret == 0) {
            /*
             * After all the gem state is reset, increment the reset
             * counter and wake up everyone waiting for the reset to
             * complete.
             *
             * Since unlock operations are a one-sided barrier only,
             * we need to insert a barrier here to order any seqno
             * updates before
             * the counter increment.
             */
            smp_mb__before_atomic();
            atomic_inc(&dev_priv->gpu_error.reset_counter);

            kobject_uevent_env(&dev->primary->kdev->kobj,
                       KOBJ_CHANGE, reset_done_event);
        } else {
            atomic_or(I915_WEDGED, &error->reset_counter); <---- WHICH GETS US HERE
        }

        /*
         * Note: The wake_up also serves as a memory barrier so that
         * waiters see the update value of the reset counter atomic_t.
         */
        i915_error_wake_up(dev_priv, true);
    }
}

mattmacy · 2016-05-30T02:43:51Z

/**
 * i915_reset - reset chip after a hang
 * @dev: drm device to reset
 *
 * Reset the chip.  Useful if a hang is detected. Returns zero on successful
 * reset or otherwise an error code.
 *
 * Procedure is fairly simple:
 *   - reset the chip using the reset reg
 *   - re-init context state
 *   - re-init hardware status page
 *   - re-init ring buffer
 *   - re-init interrupt state
 *   - re-init display
 */
int i915_reset(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    bool simulated;
    int ret;

    intel_reset_gt_powersave(dev);

    mutex_lock(&dev->struct_mutex);

    i915_gem_reset(dev);

    simulated = dev_priv->gpu_error.stop_rings != 0;

    ret = intel_gpu_reset(dev); <---- EITHER THIS FAILS

    /* Also reset the gpu hangman. */
    if (simulated) {
        DRM_INFO("Simulated gpu hang, resetting stop_rings\n");
        dev_priv->gpu_error.stop_rings = 0;
        if (ret == -ENODEV) {
            DRM_INFO("Reset not implemented, but ignoring "
                 "error for simulated gpu hangs\n");
            ret = 0;
        }
    }

    if (i915_stop_ring_allow_warn(dev_priv))
        pr_notice("drm/i915: Resetting chip after gpu hang\n");

    if (ret) {
        DRM_ERROR("Failed to reset chip: %i\n", ret);
        mutex_unlock(&dev->struct_mutex);
        return ret;
    }

    intel_overlay_reset(dev_priv);

    /* Ok, now get things going again... */

    /*
     * Everything depends on having the GTT running, so we need to start
     * there.  Fortunately we don't need to do this unless we reset the
     * chip at a PCI level.
     *
     * Next we need to restore the context, but we don't use those
     * yet either...
     *
     * Ring buffer needs to be re-initialized in the KMS case, or if X
     * was running at the time of the reset (i.e. we weren't VT
     * switched away).
     */

    /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset */
    dev_priv->gpu_error.reload_in_reset = true;

    ret = i915_gem_init_hw(dev); <---- OR THIS FAILS

    dev_priv->gpu_error.reload_in_reset = false;

    mutex_unlock(&dev->struct_mutex);
    if (ret) {
        DRM_ERROR("Failed hw init on reset %d\n", ret);
        return ret;
    }

mattmacy · 2016-05-30T02:48:37Z

Maybe try and instrument this to make sure we're actually waiting for 500ms here?

static int gen6_do_reset(struct drm_device *dev)
{
    struct drm_i915_private *dev_priv = dev->dev_private;
    int ret;

    /* Reset the chip */

    /* GEN6_GDRST is not in the gt power well, no need to check
     * for fifo space for the write or forcewake the chip for
     * the read
     */
    __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_FULL);

    /* Spin waiting for the device to ack the reset request */
    ret = wait_for((__raw_i915_read32(dev_priv, GEN6_GDRST) & GEN6_GRDOM_FULL) == 0, 500); <----- THIS MAY BE DODGY

    intel_uncore_forcewake_reset(dev, true);

    return ret;
}

johalun · 2016-05-30T02:59:01Z

I get "reset in progress". What API can I use to time a function in a kernel driver?

mattmacy · 2016-05-30T03:04:31Z

from sys/time.h:

/*
 * Functions for looking at our clock: [get]{bin,nano,micro}[up]time()
 *
 * Functions without the "get" prefix returns the best timestamp
 * we can produce in the given format.
 *
 * "bin"   == struct bintime  == seconds + 64 bit fraction of seconds.
 * "nano"  == struct timespec == seconds + nanoseconds.
 * "micro" == struct timeval  == seconds + microseconds.
 *
 * Functions containing "up" returns time relative to boot and
 * should be used for calculating time intervals.
 *
 * Functions without "up" returns UTC time.
 *
 * Functions with the "get" prefix returns a less precise result
 * much faster than the functions without "get" prefix and should
 * be used where a precision of 1/hz seconds is acceptable or where
 * performance is priority. (NB: "precision", _not_ "resolution" !)
 */

void    binuptime(struct bintime *bt);
void    nanouptime(struct timespec *tsp);
void    microuptime(struct timeval *tvp);

This is purely for instrumentation so I think the added overhead of the extra resolution is ok.
"microuptime" sounds like the way to go.

johalun · 2016-05-30T03:53:57Z

I don't get any output at all from gen6_do_reset()...

johalun · 2016-05-30T03:58:06Z

Getting -5 from intel_gpu_reset(). Btw, isn't cherryview gen8? Have to stop now but can keep digging tomorrow.

mattmacy · 2016-05-30T03:58:18Z

Try sticking a BACKTRACE() in each of the reset functions to see which is getting called.

mattmacy · 2016-05-30T03:58:32Z

gen6_do_reset is called from gen8_do_reset

mattmacy · 2016-05-30T04:09:05Z

I'm on #freebsd-xorg on EFnet much of the time. Easier to discuss in real-time.

johalun · 2016-05-30T16:59:48Z

reset request timeouts so gen6_do_reset never gets called..

[drm:cherryview_enable_rps] setting GPU freq to 400 MHz (40)
[drm] stuck on render ring
[drm] stuck on blitter ring
[drm] stuck on bsd ring
[drm] stuck on video enhancement ring
[drm] GPU HANG: ecode 8:0:0x00201001, reason: Ring hung, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
[drm:i915_reset_and_wakeup] resetting chip
[drm:0xffffffff82a2b4bds] *ERROR* gpu hanging too fast, banning!
intel_gpu_reset() at intel_gpu_reset+0x1e/frame 0xfffffe004996c790
i915_reset() at i915_reset+0x6d/frame 0xfffffe004996c800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe004996c890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe004996c990
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x49f/frame 0xfffffe004996ca60
gen8_do_reset() at gen8_do_reset+0x1e/frame 0xfffffe004996c740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe004996c790
i915_reset() at i915_reset+0x6d/frame 0xfffffe004996c800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe004996c890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe004996c990
[drm:0xffffffff82ba00ccs] *ERROR* blitter ring: reset request timeout
i915_reset: intel_gpu_reset returns -5
drm/i915: Resetting chip after gpu hang
[drm:0xffffffff82a1511ds] *ERROR* Failed to reset chip: -5

mattmacy · 2016-05-30T17:02:59Z

Where though?

johalun · 2016-05-30T17:27:51Z

Not sure what this means but at the second ring reset fails.

[drm:i915_reset_and_wakeup] resetting chip
[drm:0xffffffff82a2b4bds] ERROR gpu hanging too fast, banning!
intel_gpu_reset() at intel_gpu_reset+0x1e/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe0049976990
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x49f/frame 0xfffffe0049976a60
gen8_do_reset() at gen8_do_reset+0x21/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
i915_handle_error() at i915_handle_error+0x11e/frame 0xfffffe0049976990
gen8_do_reset: for each ring i=0. dev=0xfffffe00016b2000. engine=0xfffffe00016b38d8.
wait_for_register() at wait_for_register+0x23/frame 0xfffffe00499766b0
gen8_do_reset() at gen8_do_reset+0x132/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
gen8_do_reset: for each ring i=1. dev=0xfffffe00016b2000. engine=0xfffffe00016b4b78.
wait_for_register() at wait_for_register+0x23/frame 0xfffffe00499766b0
gen8_do_reset() at gen8_do_reset+0x132/frame 0xfffffe0049976740
intel_gpu_reset() at intel_gpu_reset+0x7a/frame 0xfffffe0049976790
i915_reset() at i915_reset+0x6d/frame 0xfffffe0049976800
i915_reset_and_wakeup() at i915_reset_and_wakeup+0x162/frame 0xfffffe0049976890
[drm:0xffffffff82ba0104s] ERROR blitter ring: reset request timeout
gen8_do_reset: for each ring do reset (not ready) i=0. dev=0xfffffe00016b2000. engine=0xfffffe00016b38d8.
gen8_do_reset: for each ring do reset (not ready) i=1. dev=0xfffffe00016b2000. engine=0xfffffe00016b4b78.
gen8_do_reset: for each ring do reset (not ready) i=2. dev=0xfffffe00016b2000. engine=0xfffffe00016b5e18.
gen8_do_reset: for each ring do reset (not ready) i=4. dev=0xfffffe00016b2000. engine=0xfffffe00016b8358.
i915_reset: intel_gpu_reset returns -5
drm/i915: Resetting chip after gpu hang
[drm:0xffffffff82a1511ds] ERROR Failed to reset chip: -5

johalun · 2016-05-30T17:28:16Z

I'm on IRC btw.

mattmacy · 2016-05-30T17:33:28Z

nick? I'm on #freebsd-xorg on Efnet.

mattmacy · 2016-06-04T23:12:01Z

Works for me!

mattmacy · 2016-06-04T23:13:00Z

Please file a separate issue for any further problems.

Drop scan generation number and node table scan lock - the only place where ni_scangen is checked is in ieee80211_timeout_stations() (and it is used to prevent duplicate checking of the same node); node scan lock protects only this variable + node table scan generation number. This will fix (at least) next LOR (hostap mode): lock order reversal: 1st 0xc175f84c urtwm0_scan_loc (urtwm0_scan_loc) @ /usr/src/sys/modules/wlan/../../net80211/ieee80211_node.c:2019 2nd 0xc175e018 urtwm0_com_lock (urtwm0_com_lock) @ /usr/src/sys/modules/wlan/../../net80211/ieee80211_node.c:2693 stack backtrace: #0 0xa070d1c5 at witness_debugger+0x75 #1 0xa070d0f6 at witness_checkorder+0xd46 #2 0xa0694cce at __mtx_lock_flags+0x9e #3 0xb03ad9ef at ieee80211_node_leave+0x12f #4 0xb03afd13 at ieee80211_timeout_stations+0x483 #5 0xb03aa1c2 at ieee80211_node_timeout+0x42 #6 0xa06c6fa1 at softclock_call_cc+0x1e1 #7 0xa06c7518 at softclock+0xc8 #8 0xa06789ae at intr_event_execute_handlers+0x8e #9 0xa0678fa0 at ithread_loop+0x90 #10 0xa0675fbe at fork_exit+0x7e #11 0xa08af910 at fork_trampoline+0x8 In addition to the above: * switch to ieee80211_iterate_nodes(); * do not assert that node table lock is held, while calling node_age(); that's not really needed (there are no resources, which can be protected by this lock) + this fixes LOR/deadlock between ieee80211_timeout_stations() and ieee80211_set_tim() (easy to reproduce in HOSTAP mode while sending something to an STA with enabled power management). Tested: * (avos) urtwn0, hostap mode * (adrian) AR9380, STA mode * (adrian) AR9380, AR9331, AR9580, hostap mode Notes: * This changes the net80211 internals, so you have to recompile all of it and the wifi drivers. Submitted by: avos Approved by: re (delphij) Differential Revision: https://reviews.freebsd.org/D6833

Update mountd

mattmacy self-assigned this May 28, 2016

mattmacy changed the title ~~Kernel panic on Cherry Trail when kldload i915kms~~ GPU hang on Cherry Trail May 30, 2016

mattmacy closed this as completed Jun 4, 2016

nomadlogic mentioned this issue Jun 6, 2016

6d310b5(drm-next-4.6) & xf86-video-intel-2.99.917.20160417 panic #18

Closed

Krisnda mentioned this issue Jul 10, 2016

drm 4.6 radeon panic on gnome-shell load #48

Closed

mattmacy mentioned this issue Jul 12, 2016

linux chromium dependencies #50

Closed

nomadlogic mentioned this issue Aug 11, 2016

Kernel Panic when using modesetting driver on 762b75d(drm-next-4.6) #60

Closed

mjoras pushed a commit to mjoras/freebsd-base-graphics that referenced this issue Jun 16, 2017

Merge pull request FreeBSDDesktop#7 from skarekrow/patch-1

cea905a

Update mountd

gldisater mentioned this issue Jul 14, 2017

RX480 regression #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU hang on Cherry Trail #7

GPU hang on Cherry Trail #7

johalun commented May 26, 2016

nomadlogic commented May 26, 2016

johalun commented May 26, 2016

johalun commented May 26, 2016

mattmacy commented May 28, 2016

johalun commented May 28, 2016

mattmacy commented May 29, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016 •

edited

Loading

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016 •

edited

Loading

mattmacy commented Jun 4, 2016

mattmacy commented Jun 4, 2016

GPU hang on Cherry Trail #7

GPU hang on Cherry Trail #7

Comments

johalun commented May 26, 2016

nomadlogic commented May 26, 2016

johalun commented May 26, 2016

johalun commented May 26, 2016

mattmacy commented May 28, 2016

johalun commented May 28, 2016

mattmacy commented May 29, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

mattmacy commented May 30, 2016

johalun commented May 30, 2016 • edited Loading

mattmacy commented May 30, 2016

johalun commented May 30, 2016

johalun commented May 30, 2016

mattmacy commented May 30, 2016 • edited Loading

mattmacy commented Jun 4, 2016

mattmacy commented Jun 4, 2016

johalun commented May 30, 2016 •

edited

Loading

mattmacy commented May 30, 2016 •

edited

Loading