Retail Self-checkout Object Detection Solution using Azure Percept

Published Jan 27 2022 08:00 AM 4,052 Views
Occasional Visitor



There is a high demand for self-checkouts in grocery stores. This is because it has several advantages, such as a faster process (and hence shorter waiting lines) or, in the case of a pandemic such as the COVID-19, safer iteration as fewer people need to touch the products. Computer vision can help with these tasks by automatically detecting objects and the number of items, especially in fruit detection, where the self-checkout kiosk would already have information on the fruits collected in the basket. 


To implement the Retail Self-checkout Object Detection solution using Azure Percept, we can choose between a no code approach, an approach requiring some code (low code), and the option of customizing every small detail (pure code). This flexibility allows us to work on a vast range of projects and timeframes, i.e., supercharging POCs and MVPs; the platform incorporates scalability in its core, enabling us to push the ML system to any number of edge devices.


In order to explore the capabilities of Azure Percept, a team from Cognizant enrolled to the Microsoft Azure Percept Bootcamp. Using the knowledge from the bootcamp, we developed a Retail Self-checkout Object Detection solution, outlined below, and deployed it to Azure Percept DK. The solution and the approaches we used are detailed in this article.


Overview of the Solution


We implemented the Retail Self-checkout Object Detection Solution using Azure Percept using three different approaches: No Code, Low Code and Pure Code, on the same fruit detection use case. Each approach will iteratively require more customization and allow for more flexibility. We have outlined each approach in detail in separate sections below.  


A brief description of the solution can be found in the following YouTube video:



The overall architecture for the solution is shown below. The solution features an integrated framework with Azure Key Services: Azure Percept Studio, Azure Machine Learning Studio, Azure Custom Vision, IoT Hub and IoT Edge. 




General Pre-requisites – What is required to get started


Solution Implementation 


Retail Self-checkout Object Detection Solution: No Code Implementation 



The first approach for deploying an object detection model using Azure Percept is without any kind of coding. This way, a user will acquire the dataset manually with Azure Percept Vision. Individually labeling each image, putting the user in direct control of what the model is going to be trained on, and the performance metrics of each model training iteration. Finally, we can manually choose the best performing model to be deployed to Azure Percept DK. 

You will need to:

  1. Capture data, with the custom vision service.
  2. Label the data with the custom vision service.
  3. Train an object detection model, with the custom vision service.
  4. Publish the model and download the solution module to Azure Percept DK, with Azure Percept Studio.



  • “Custom Vision” resource – This you will create following below steps (same as MS tutorial)



In this approach we use the Custom Vision service Azure is offering. This service allows us to connect a Custom Vision project to the Azure Percept Studio, and capture images with Azure Percept Vision, which automatically makes the images available on the Custom Vision service. After we have captured the necessary images of the fruits we are interested in,

either by single snapshots, or based on a timer, we create bounding box labels for these. (As above MS tutorial is very detailed in its steps, and will be updated as service gets updated, we will neglect specific steps here.)


After these steps, labelling our images of fresh fruits, we end up with something like presented in the bottom screenshot.





This part using the Custom Vision service is pretty straight forward. Just like I love having a limited number of coffee options to make it easier to pick one in the end, Microsoft, maybe not by choice, presents a limited number of model options to pick form. Basic idea behind blow options is probably to give quick starters and fairly general transfer learning options.

You can read more on the domains and model footprint here




Naturally we chose the object detection project, and for the domain, we picked General (compact). As this needs to be pushed to Azure Percept DK and give real-time inferencing.


Training and testing

This section is even easier than the model decision step. We only have to hit the green train button, and decide for a quick training option, or an “advanced” – advanced being for how many hours we want to train our model.




No need to get familiarized with numerous hyperparameters and model optimization options. Start the training, and come back after the coffee break, and voila. A finished trained model, on your use case. 


You can run the training multiple times and a new iteration tag will show up in your list, these indicate a new model. You can toggle the threshold values, and your model’s performance shows based on your training data, not recommended as final model KPIs, use a separate test set.


Model deployment

To finish off the No Code approach, we simply need to follow the step wise instructions on highlighting what model we want deployed and to what device (Azure Percept DK). And after a few point-and-click steps, you can see the video stream of your device, with your custom labels and objects of interest.


Final comments

The No Code capability of Azure Percept is really easy to learn and follow along with. It will help you get an MVP IoT solution with ease! 


The following image is captured from the video feed of Azure Percept Vision.




Retail Self-checkout Object Detection Solution: Low Code Implementation


A low code approach could be an option for deploying an object detection model based on a previously existing dataset with all labels indicated. This way we don't need to manually label each object of interest. Here we create the fruit detection system by implementing the solution with Azure available no-code services, and train a model based on pre-labeled data. The data, we will load to the Custom Vision service through the Custom Vision Library.

We will need to:

  1. Acquire labelled data from Kaggle (or other sources).
  2. Push labelled data with custom vision client library.
  3. Train an object detection model, with the custom vision service.
  4. Publish the model and download the solution module to Azure Percept DK, with Azure Percept Studio.



(To follow along with custom vision library steps refer to MS tutorial:



The main point for this approach is to explore the flexibility of Custom Vision and Azure Percept services in accepting already labelled data. 

There is already a huge number of open-source datasets, we even might have the relevant labelled data ourselves, then why bother labelling new data?  Fortunately, Microsoft has enabled loading of our own data to its services! To test this capability, we first need data – labelled data of our fruits, bananas, apples and oranges! 

A quick search leads us to numerous open-source datasets, we ended up using this one, from Kaggle: This project contains a web -scrap-based dataset, i.e., random images from the web on objects of interest, including bounding box tags. 


Instructions on enabling the dataset: Bananas, Apples and Oranges

In order to use this dataset, we need to write some code and utilize the custom vision library! (Low Code part) 


The following code snippets will just highlight the diff between the tutorial/test code( of Microsoft and our code.


  1. We needed to download the dataset, Kaggle requires an account to be able to download Kaggle content – however its free.
  2. We provisioned an Azure ML studio, and a standard compute instance. (no specific need for the ml studio, but makes the environment setup easier)
  3. We copied over the template/Quickstart from the Microsoft tutorial above. 
  4. Next, we added the account specific credentials, where needed.
  5. We replaced/added the tags with tags specific to our project.
banana_tag = trainer.create_tag(, "banana")
apple_tag = trainer.create_tag(, "apple")
ornage_tag = trainer.create_tag(, "orange")
  1. Now we need to type some custom code to translate the original bounding box labels, so the Custom Vision service can understand it. Also, we need to type split up our data in batches during upload of the data.


The labels of this specific dataset come in xml format, additionally, the bounding box format are different to the custom vision accepted format. Custom Vision expects a normalized bounding box, and it expects the bounding box to have a < (left, top, width, height) > format. Both of which is different in our dataset.  


Custom Vision has also a couple more limitations, (1) it can only handle 64 images at a time, and (2) it has a limit of 20 tags per image. 

See code snippet figure for relevant code for this specific dataset, that includes the translation and data upload to custom vision. Following image displays labelled images in the Custom Vision portal with tags.


  1. Finally, we need to confirm the data is available on Custom Vision – which it should be. This concludes the coding for the low code approach.
base_image_location = train_dir

banana = 'banana'
apple = 'apple'
orange = 'orange'

# batch(64) available data, decode, normalize and push
print ("Adding images...")
batch_size = 64

for n in range(0, len(train_set), batch_size):
    batch = train_set[n:n+batch_size]

    tagged_images_with_regions = []

    # for every image in ach batch read xml, decode and normalize
    for sample in batch:
        with open(sample[1], 'r') as f:
            data =

        annotation = objectify.fromstring(data)

        file_name = str(annotation.filename).split('.')[0]
        img_width = annotation.size.width
        img_height = annotation.size.height

        if img_width == 0 or img_height == 0:

        regions = []
        for i in range(len(annotation.object)):

            x = annotation.object[i].bndbox.xmin
            y = annotation.object[i].bndbox.ymin
            w = annotation.object[i].bndbox.xmax - x
            h = annotation.object[i].bndbox.ymax - y

            x = x / img_width
            y = y / img_height
            w = w / img_width
            h = h / img_height

            if annotation.object[i].name == banana:
                regions.append(Region(, left=x, top=y, width=w, height=h))
            elif annotation.object[i].name == apple:
                regions.append(Region(, left=x, top=y, width=w, height=h))

            else: #orange
                regions.append(Region(, left=x, top=y, width=w, height=h))

        with open(os.path.join(base_image_location, file_name + '.jpg'), mode='rb') as image_contents:
            tagged_images_with_regions.append(ImageFileCreateEntry(name=file_name,, regions=regions))

        upload_result = trainer.create_images_from_files(, ImageFileCreateBatch(images=tagged_images_with_regions))
        if not upload_result.is_batch_successful:
            print("Image batch upload failed.")
            for image in upload_result.images:
                print("Image status: ", image.status)






Same steps as for the No Code approach. 


Training and testing

Same steps as for the No Code approach. 


Model deployment

Same steps as for the No Code approach. 


Final comments

Even though the capability of pushing labelled data is hidden behind a library, it is really good that Microsoft has this capability! This increases the value of this service tremendously!


The following image, (source: random image found on google), highlights the models’ ability to detect objects of interest – by training a model on pre-labelled data.




Retail Self-checkout Object Detection Solution: Pure Code



Finally, a user may want to have a fully custom approach on the data acquisition/labeling and model training/analysis. In this case, one might only be interested in deploying the final model to Azure Percept DK. This is also supported, here, we will implement the fruit detection system by training a custom model within Azure ML studio (you can choose your preferred platform) and configure required containerization files to enable deployment on the Azure Percept DK.


We will need to:

  1. Acquire labelled data from COCO (or other sources).
  2. Train an object detection model, on Azure ML studio (or other platforms).
  3. Publish the model and download the solution module to Azure Percept DK, with Azure Percept SDK.

The aim of this approach was to deploy a custom object detection model to Azure Percept DK through the Module Twin update feature. Through this approach, we end up with having a broader range of models and set-ups to choose from, while at the same time having more control over the process of going from a solution to a use case to an end-to-end object detection system deployed on the edge.


We used some of the available online resources from the Azure Percept team in the following solution. They are located here:


In this specific case, we have been using one of Azure Percept’s own tutorials, which contains all the steps from data acquisition to model building and training and lastly, model deployment. The particular notebook we used as inspiration can be found here:





The dataset we use for this approach is a subset of the publicly available image dataset COCO, where we filter out the redundant classes, leaving us with images of bananas, apples and oranges in various settings. The dataset contains around 4500 images of these three classes, along with their respective bounding boxes and labels. For more information about the COCO dataset, visit their website:



We choose to go with a TensorFlow version of the SSD-MobileNet model for this task, due to its limited footprint and hence its suitability for edge deployment, while at the same time providing a solid performance in object detection tasks. The SSD (short for Single-Shot Detector) is a popular object detection architecture in scenarios where inference speed and model footprint is of high priority. The key feature of this type of architecture is its ability to produce bounding box estimates straight away, instead of having to first produce proposals for possible bounding boxes. Additionally, its backbone network (the feature extraction part) is completely independent, meaning that it is replaceable. This enables us to use a model architecture like MobileNet ( for this purpose, a relatively small and lightweight image classification architecture, well suited for our needs in this task. 


Training and testing

To ease our efforts slightly, we use a pre-trained model for this task. Further, the model is trained on our dataset for around 30000 epochs, to obtain a fair level of accuracy upon inference. 


After training, we test our model performance on a couple of test images. Below we display two model outputs: 






Model deployment

After the model is trained, we convert the model to the OpenVINO IR (Intermediate Representation) format that Azure Percept DK demands. OpenVINO is Intel’s open-source toolkit for optimizing and deploying AI models on the Intel hardware, such as the Intel Movidius Myriad X (MA2085) VPU on Azure Percept Vision. In short, the IR format is used for converting deep learning models from frameworks like TensorFlow and further, optimizing the model graph so that the inference latency and general footprint is greatly reduced. For more information about OpenVINO and their toolkits, visit


After the model is converted, we upload it to Azure Storage in the form of a blob before the model is finally replaced in the Azureeyemodule through the Module Twin update feature. Essentially, the only thing that needs to be changed is where the module looks for its detection model. We thus update the module with a link to where we stored our model. 



After this is done and the module is updated, we start the camera stream with Azure Percept up, and we should see our model inferencing outputs.  


Final comments

Overall, this approach provides a highly customizable way of deploying a deep learning framework of our own choosing to Azure Percept. The Module Twin Update method enables a fast and simple model deployment to the device and with the Azureeyemodule, real-time inference is seamlessly integrated into Azure Percept DK.  


Closing remarks


Azure Percept Development Kit, Azure Percept Vision module along with the Custom Vision tech-stack is a really powerful tool, enabling just about anyone, no matter skill proficiency, to create an intelligent vision solution. This can be backed up by the fact that all three implementations outlined above (no code, low code and pure code), of the Retail Self-checkout Object Detection Solution, was completed within one week. The documentation and intuitive implementation of the tech stack has also allowed us to quickly skill up several teammates. 


Since Cognizant teams’ participation in the Microsoft Azure Percept Bootcamp, we have used the tech stack in a number of POCs, and engagements. It has in particular accelerated our real-time decision-making offerings. We are also very fond of the level of integration, and the comprehensiveness of these services, enabling us to create simple iterations of the use case earlier in the engagement, which allows us to capture and build trust in the audience faster. We encourage you to try Azure Percept and deploy your model with a single click. 


Resources for learning more about Azure Percept


Version history
Last update:
‎Mar 15 2022 11:16 AM