That definitely sounds rather complex. So you are going to use hand gesture recognition of some sort? With how much processing that will take, and need for real-time, that might be best to do as a native app, since you would have to be constantly streaming to a server for the recognition if you went with a web-based app. You'll probably still have a server in the middle for connecting clients. Be sure to plan this out thoroughly, there may be a few ways to go about it.
For example, Google's voice recognition seems to use a mix of both. On my Android phone, if I issue a command with my voice, it interprets most of it locally (since it is trained to my voice), but then I think it reaches out to Google's servers if it has trouble, or to get what it should actually do with my command. E.g., it can recognize "text my wife" locally, but then shoots that out to Google's servers, which reach back with, "that means you should ask for a message, then send an SMS intent to the text messaging app for xxx-xxx-xxxx". That way, Google can constantly update/tweak what commands do on their end without having to push full app updates hourly.
You could take a similar approach, and do all the telemetry of measuring the hand gestures locally (just like xyz coordinates, paths, or however you measure that), then send stream of data to your server to interpret what the gesture means. The server then responds back with "that was a Q", and your app handles that by displaying the text and/or voice-over. Then, you can easily update your gesture formulas and such on your server without having to re-build the whole app.
Just one idea of an approach.
Good luck with that.