Since we started kooaba back in 2007, we have always focused on running object recognition algorithms on the server, and not on a client. Why? There are several reasons.
Why object recognition should be run on a server
Scale: using the same quality of recognition on a mobile device you are limited to a few dozen items, maybe you can scale it to hundreds. We want to recognize millions. This is only possible on a server. One might argue that transfer times of the query image to the server might be slow, or that one might not have network connectivity at all. Fortunately our algorithms can work with quite strongly compressed images. This limits transfer time. Plus, time is working for us: LTE makes transfer and latency times less of an issue. And in many use cases users are actually in WiFi most of the time. (For instance, in surveys for Shortcut we found out that most users like to use the app at home.)
Flexibility: on the server side algorithms can be adapted, tuned, changed anytime. There can also be new kinds of recognition added on the fly, for instance face recognition or OCR. The same image can be sent in parallel to several services.
Fragmentation: due to the heterogeneity of platforms out there (iOS, Android, Windows Phone, …) you are well advised to keep the client-side code as simple as possible.
All that being said, there are some cases where we believe computer vision algorithms come in handy on the client and should be run there. However, they don’t do object recognition.
What vision algorithms can and should be run on the mobile client?
The Shortcut app is a good example of an app combining client-side recognition and server side recognition.
Recognition of newspaper and magazine pages is run on the server. Each month we add hundreds of thousands of pages to the index, the database would just be way to large to keep on the client.
However, the latest version 3.0 of Shortcutt also scans QR codes. This is done on the client. Why? QR and others canners benefit a lot from high-resolution images. It’s not advisable to send them to the server. Furthermore, usually the code algorithms “try” a number of image frames to find a code. A code may be detected in one frame but not the other. Thus sending a rapid sequence of high-res images to the code scanner gives best results, and this should be done on the client. (Computer vision for reading QR-codes is also comparatively simple to object recognition).
But there is more: the server-side object recognition actually needs a small number of lower resolution images to work with. In a “live-scanning” scenario like in Shortcut 3.0, users don’t actively release the shutter to start a search, rather there is an automated “scanning” process on the client, which sends frames to the server. This process should select the images to send in a “smart” way. For instance: if nothing changed (e.g. the user did not move the device) there is no need to send another query to the server. This is another reason for running computer vision on the client. We employ a “scene change detection” algorithm on the client. This algorithm notices when the scene changes and uses that information to trigger new queries to the server.
The video below shows how this works: when either the phone moves, or the scene changes without moving the phone (moving an item in front of the scene), the client stops scanning after a while (illustrated by the “no results found” message). Of course, once an object that can be recognised enters the scene, the matching result is shown.
So, client side computer vision is used on the client to improve the user experience, and to do simple recognition of codes. Heavy lifting object recognition is done on the server.
I am a developer – when can I use it, too?
We have gotten a number of requests from developers regarding how to create the same scanning-like experience as in Shortcut. We thus plan to make a library or source code available to you in the near future.