Hacking the Kinect

As our Product and Design teams at Polar barrel towards the MediaEverywhere launch we took some time to step back from our immediate roadmap needs and look into the not-so-distant future.

While mobile growth is exploding and will likely continue to do so for the foreseeable future, we would be remiss in thinking that touch-based interaction is the final act in the rapidly changing world of human-computer user experience (UX). With the Xbox Kinect, and rumours of an Apple television looming, it’s hard not to think about the impending shift in user interaction design. Samsung, LG, Sony et al. are all producing web-enabled televisions armed with the same full web stack which we have begun building on, so the opportunity to extend our UX design is obvious to us.

This opportunity is not without its particular challenges though; control and distance being two of the most critical.

Touch-based devices have enabled unparalleled control of the UI, and we can now flick, pinch and slide our way around an application’s interface. This brings the user much closer to the content they are experiencing, and adds a level of intuition that is unprecedented in the field of human-computer interaction (HCI).

Rotating and scaling a photo is intuitive, as is touching it to make edits and seeing the image respond with the changes. However, as we introduce the “10-foot experience”, where a user is at a greater distance from the context which they want to interact with, the touch/mouse/keyboard interaction model falls flat. Furthermore, there is rarely any physical interaction with the system, which adds additional challenges. We can no longer click, type, pinch, flick or slide our content.

The complexity of applications we are using today has made the physical remote a burden in all but most cases (the AppleTV has done an excellent job of simplifying the UX so a basic remote can be used. However, tasks such as inputting text are still cumbersome and frustrating due to this simplified interaction model). Further, when you consider the viewing distance of a television or larger screen, it becomes apparent that the UI itself must change. The small text and touch targets used in most cases will no longer be of use on a screen that is further away. As Fitts’ Law shows, the size of a target is a direct indicator of its usability and as such one needs to adapt the scale of UI for these larger and less precise inputs.

As such, we are left with two problems when adapting a touch/click UX for interaction at-a-distance: 1) the existing layout conventions and 2) the interaction models lack usability and require updates.

There are a couple of emerging ideas to solve the interaction limitations: gestural input and voice input. Microsoft introduced the world to the Kinect and with it unlocked the potential for affordable, full body gestural interaction in your family room. While a remarkable piece of technology, the Kinect is not a panacea to the interaction problem with the television; there are many issues that we will highlight later on in this post.

Hack Day: Kinect meets MediaEverywhere

When we started brainstorming ideas for our last hack day we had a bunch of really interesting problems but by the end, we were most excited about a javascript bridge that someone had built to interface with the Kinect through USB. There has not been much work done with gestural UX in the browser, and so with our idea in hand we were off to the races.


Depth.js takes advantage of OpenNI/NITE in the backend to collect the raw data from the Kinect and then exposes that raw data to the browser in the form of an extension. Through OpenNI/NITE depth.js includes gestural recognition for many basic interactions out of the box, and we were easily able to capture events for swiping, pushing, waving, etc.

There are many resources online to help you get started with building depth.js, and we went with the Safari extension to build our MediaEverywhere TV prototype.


One of our first problems to solve was that of the UI. How would we provide feedback to the user about their current context and status of their interactions? While not ground-breaking per se, we decided to overlay a hand graphic on top of our existing MediaEverywhere UI layer. This graphic would track 1:1 with the user’s hand and change colors to indicate context, providing the needed feedback on how further actions (such as a push or swipe) would behave.

Depth.js by default tracks the first point it recognizes as a hand, solving that problem for us. It then streams spatial data (x, y, z) of that point to the browser in a ‘move’ event. By mapping the x and y data from the Kinect bridge to the spatial coordinates of our UI, we were able to determine the location of the user’s hand in relation to our UI.

To render the graphic, we overlaid an HTML5 canvas element over our entire MediaEverywhere UI layer. We ran a basic draw loop using setTimeout to run at 60fps (in lieu of requestAnimationFrame which isn’t available in Safari), which continuously rendered the hand graphic in various colors depending on the state of the interaction (green for ‘ready’, red for ‘swipe recognized’, yellow for ‘waiting’, etc). When we detected gestures such as a push or swipe, we relied on window.getElementAt(x,y) to find the element the user intended to interact with, and called the appropriate methods from there. For example, when we detected a ‘push’ on an article, we took the user into the article. A ‘swipe’ or ‘wave’ took the user back a level in the navigation.

As a nice touch of detail, we subtly scaled the hand graphic based on the z-depth information we received about the user’s hand, which made gestures such as a ‘push’ more responsive and natural.

The Issues

While we developed a working demo in a little under 9 hours, some obvious issues arose which would need to be solved before this technology can be used in production.

First, because the Kinect was not designed to interface with javascript in the browser and our solution involves many moving parts working in perfect unison, there were occasional lags and disconnections which would frustrate the average user.

Second, with any new technology there is a period of adoption where interactions must be learned anew. Just as users were forced to relearn some interaction paradigms during the shift from mouse-driven to touch-based UIs, the same learning period will exist for gestural experiences.

Cueing the user as to what is possible and what is expected is no easy task, and would require significant education.

The Future

While we believe that the traditional interaction models of mouse/keyboard input, and more recently touch-based input will not disappear anytime soon, we expect the challenges presented by the “10-foot experience” of connected TVs and larger screens will demand a paradigm shift in UX design and Interaction design. While it is still not clear whether gestural input, voice-based commands, or some yet discovered interaction mode will reign in the long term, we are in a very exciting period where the possibilities are endless. The Kinect is the first successful example of an affordable, widespread piece of technology which makes these new experiences possible, and are unlikely to be the last. With expectations that the XBox 360 will soon support Kinect for Internet Explorer, the way we design the web could change sooner than expected.

In the meantime, despite the challenges and frustrations of building and hacking the Kinect into the browser, we had an awesome time and were excited by the possibilities it opened up to us. For anyone looking to explore the realm of gesture based interaction, Depth.js and the Kinect are a great way to get started building your own prototypes!

First published May 24 2012