Microsoft Foundry Blog

8 MIN READ

Ombromanie: Creating Hand Shadow stories with Azure Speech and TensorFlow.js Handposes

Former Employee

Feb 25, 2021

Have you ever tried to cast hand shadows on a wall? It is the easiest thing in the world, and yet to do it well requires practice and just the right setup. To cultivate your #cottagecore aesthetic, try going into a completely dark room with just one lit candle, and casting hand shadows on a plain wall. The effect is startlingly dramatic. What fun!

Even a tea light suffices to create a great effect

In 2020, and now into 2021, many folks are reverting back to basics as they look around their houses, reopening dusty corners of attics and basements and remembering the simple crafts that they used to love. Papermaking, anyone? All you need is a few tools and torn up, recycled paper. Pressing flowers? All you need is newspaper, some heavy books, and patience. And hand shadows? Just a candle.

This TikTok creator has thousands of views for their handshadow tutorials

But what's a developer to do when trying to capture that #cottagecore vibe in a web app?

High Tech for the Cottage

While exploring the art of hand shadows, I wondered whether some of the recent work I had done for body poses might be applicable to hand poses. What if you could tell a story on the web using your hands, and somehow save a video of the show and the narrative behind it, and send it to someone special? In lockdown, what could be more amusing than sharing shadow stories between friends or relatives, all virtually?

Hand shadow casting is a folk art probably originating in China; if you go to tea houses with stage shows, you might be lucky enough to view one like this!

A Show Of Hands

When you start researching hand poses, it's striking how much content there is on the web on the topic. There has been work since at least 2014 on creating fully articulated hands within the research, simulation, and gaming sphere:

MSR throwing hands

There are dozens of handpose libraries already on GitHub:

There are many applications where tracking hands is a useful activity:

• Gaming
• Simulations / Training
• "Hands free" uses for remote interactions with things by moving the body
• Assistive technologies
• TikTok effects :trophy:
• Useful things like Accordion Hands apps

One of the more interesting new libraries, handsfree.js, offers an excellent array of demos in its effort to move to a hands free web experience:

Handsfree.js, a very promising project

As it turns out, hands are pretty complicated things. They each include 21 keypoints (vs PoseNet's 17 keypoints for an entire body). Building a model to support inference for such a complicated grouping of keypoints has provenn challenging.

There are two main libraries available to the web developer when incorporating hand poses into an app: TensorFlow.js's handposes, and MediaPipe's. HandsFree.js uses both, to the extent that they expose APIs. As it turns out, neither TensorFlow.js nor MediaPipe's handposes are perfect for our project. We will have to compromise.

TensorFlow.js's handposes allow access to each hand keypoint and the ability to draw the hand to canvas as desired. HOWEVER, it only currently supports single hand poses, which is not optimal for good hand shadow shows.
MediaPipe's handpose models (which are used by TensorFlow.js) do allow for dual hands BUT its API does not allow for much styling of the keypoints so that drawing shadows using it is not obvious.

One other library, fingerposes, is optimized for finger spelling in a sign language context and is worth a look.

Since it's more important to use the Canvas API to draw custom shadows, we are obliged to use TensorFlow.js, hoping that either it will soon support multiple hands OR handsfree.js helps push the envelope to expose a more styleable hand.

Let's get to work to build this app.

Scaffold a Static Web App

As a Vue.js developer, I always use the Vue CLI to scaffold an app using vue create my-app and creating a standard app. I set up a basic app with two routes: Home and Show. Since this is going to be deployed as an Azure Static Web App, I follow my standard practice of including my app files in a folder named app and creating an api folder to include an Azure function to store a key (more on this in a minute).

In my package.json file, I import the important packages for using TensorFlow.js and the Cognitive Services Speech SDK in this app. Note that TensorFlow.js has divided its imports into individual packages:

"@tensorflow-models/handpose": "^0.0.6",
"@tensorflow/tfjs": "^2.7.0",
"@tensorflow/tfjs-backend-cpu": "^2.7.0",
"@tensorflow/tfjs-backend-webgl": "^2.7.0",
"@tensorflow/tfjs-converter": "^2.7.0",
"@tensorflow/tfjs-core": "^2.7.0",
...
"microsoft-cognitiveservices-speech-sdk": "^1.15.0",

Set up the View

We will draw an image of a hand, as detected by TensorFlow.js, onto a canvas, superimposed onto a video suppled by a webcam. In addition, we will redraw the hand to a second canvas (shadowCanvas), styled like shadows:

<div id="canvas-wrapper column is-half">
<canvas id="output" ref="output"></canvas>
    <video
        id="video"
        ref="video"
        playsinline
        style="
          -webkit-transform: scaleX(-1);
           transform: scaleX(-1);
           visibility: hidden;
           width: auto;
           height: auto;
           position: absolute;
         "
    ></video>
 </div>
 <div class="column is-half">
    <canvas
       class="has-background-black-bis"
       id="shadowCanvas"
       ref="shadowCanvas"
     >
    </canvas>
</div>

Load the Model, Start Keyframe Input

Working asynchronously, load the Handpose model. Once the backend is setup and the model is loaded, load the video via the webcam, and start watching the video's keyframes for hand poses. It's important at these steps to ensure error handling in case the model fails to load or there's no webcam available.

async mounted() {
    await tf.setBackend(this.backend);
    //async load model, then load video, then pass it to start landmarking
    this.model = await handpose.load();
    this.message = "Model is loaded! Now loading video";
    let webcam;
    try {
      webcam = await this.loadVideo();
    } catch (e) {
      this.message = e.message;
      throw e;
    }

    this.landmarksRealTime(webcam);
  },

Setup the Webcam

Still working asynchronously, set up the camera to provide a stream of images

async setupCamera() {
      if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
        throw new Error(
          "Browser API navigator.mediaDevices.getUserMedia not available"
        );
      }
      this.video = this.$refs.video;
      const stream = await navigator.mediaDevices.getUserMedia({
        video: {
          facingMode: "user",
          width: VIDEO_WIDTH,
          height: VIDEO_HEIGHT,
        },
      });

      return new Promise((resolve) => {
        this.video.srcObject = stream;
        this.video.onloadedmetadata = () => {
          resolve(this.video);
        };
      });
    },

Design a Hand to Mirror the Webcam's

Now the fun begins, as you can get creative in drawing the hand on top of the video. This landmarking function runs on every keyframe, watching for a hand to be detected and drawing lines onto the canvas - red on top of the video, and black on top of the shadowCanvas. Since the shadowCanvas background is white, the hand is drawn as white as well and the viewer only sees the offset shadow, in fuzzy black with rounded corners. The effect is rather spooky!

async landmarksRealTime(video) {
      //start showing landmarks
      this.videoWidth = video.videoWidth;
      this.videoHeight = video.videoHeight;

      //set up skeleton canvas
      this.canvas = this.$refs.output;
      ...

      //set up shadowCanvas
      this.shadowCanvas = this.$refs.shadowCanvas;
      ...

      this.ctx = this.canvas.getContext("2d");
      this.sctx = this.shadowCanvas.getContext("2d");

      ...

      //paint to main

      this.ctx.clearRect(0, 0, this.videoWidth, 
  this.videoHeight);
      this.ctx.strokeStyle = "red";
      this.ctx.fillStyle = "red";
      this.ctx.translate(this.shadowCanvas.width, 0);
      this.ctx.scale(-1, 1);

      //paint to shadow box

      this.sctx.clearRect(0, 0, this.videoWidth, this.videoHeight);
      this.sctx.shadowColor = "black";
      this.sctx.shadowBlur = 20;
      this.sctx.shadowOffsetX = 150;
      this.sctx.shadowOffsetY = 150;
      this.sctx.lineWidth = 20;
      this.sctx.lineCap = "round";
      this.sctx.fillStyle = "white";
      this.sctx.strokeStyle = "white";

      this.sctx.translate(this.shadowCanvas.width, 0);
      this.sctx.scale(-1, 1);

      //now you've set up the canvases, now you can frame its landmarks
      this.frameLandmarks();
    },

For Each Frame, Draw Keypoints

As the keyframes progress, the model predict new keypoints for each of the hand's elements, and both canvases are cleared and redrawn.

      const predictions = await this.model.estimateHands(this.video);

      if (predictions.length > 0) {
        const result = predictions[0].landmarks;
        this.drawKeypoints(
          this.ctx,
          this.sctx,
          result,
          predictions[0].annotations
        );
      }
      requestAnimationFrame(this.frameLandmarks);

Draw a Lifelike Hand

Since TensorFlow.js allows you direct access to the keypoints of the hand and the hand's coordinates, you can manipulate them to draw a more lifelike hand. Thus we can redraw the palm to be a polygon, rather than resembling a garden rake with points culminating in the wrist.

Re-identify the fingers and palm:

     fingerLookupIndices: {
        thumb: [0, 1, 2, 3, 4],
        indexFinger: [0, 5, 6, 7, 8],
        middleFinger: [0, 9, 10, 11, 12],
        ringFinger: [0, 13, 14, 15, 16],
        pinky: [0, 17, 18, 19, 20],
      },
      palmLookupIndices: {
        palm: [0, 1, 5, 9, 13, 17, 0, 1],
      },

...and draw them to screen:

    const fingers = Object.keys(this.fingerLookupIndices);
      for (let i = 0; i < fingers.length; i++) {
        const finger = fingers[i];
        const points = this.fingerLookupIndices[finger].map(
          (idx) => keypoints[idx]
        );
        this.drawPath(ctx, sctx, points, false);
      }
      const palmArea = Object.keys(this.palmLookupIndices);
      for (let i = 0; i < palmArea.length; i++) {
        const palm = palmArea[i];
        const points = this.palmLookupIndices[palm].map(
          (idx) => keypoints[idx]
        );
        this.drawPath(ctx, sctx, points, true);
      }

With the models and video loaded, keyframes tracked, and hands and shadows drawn to canvas, we can implement a speech-to-text SDK so that you can narrate and save your shadow story.

To do this, get a key from the Azure portal for Speech Services by creating a Service:

You can connect to this service by importing the sdk:

import * as sdk from "microsoft-cognitiveservices-speech-sdk";

...and start audio transcription after obtaining an API key which is stored in an Azure function in the /api folder. This function gets the key stored in the Azure portal in the Azure Static Web App where the app is hosted.

async startAudioTranscription() {
      try {
        //get the key
        const response = await axios.get("/api/getKey");
        this.subKey = response.data;
        //sdk

        let speechConfig = sdk.SpeechConfig.fromSubscription(
          this.subKey,
          "eastus"
        );
        let audioConfig = sdk.AudioConfig.fromDefaultMicrophoneInput();
        this.recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

        this.recognizer.recognized = (s, e) => {
          this.text = e.result.text;
          this.story.push(this.text);
        };

        this.recognizer.startContinuousRecognitionAsync();
      } catch (error) {
        this.message = error;
      }
    },

In this function, the SpeechRecognizer gathers text in chunks that it recognizes and organizes into sentences. That text is printed into a message string and displayed on the front end.

Display the Story

In this last part, the output cast onto the shadowCanvas is saved as a stream and recorded using the MediaRecorder API:

const stream = this.shadowCanvas.captureStream(60); // 60 FPS recording
      this.recorder = new MediaRecorder(stream, {
        mimeType: "video/webm;codecs=vp9",
      });
      (this.recorder.ondataavailable = (e) => {
        this.chunks.push(e.data);
      }),
        this.recorder.start(500);

...and displayed below as a video with the storyline in a new div:

      const video = document.createElement("video");
      const fullBlob = new Blob(this.chunks);
      const downloadUrl = window.URL.createObjectURL(fullBlob);
      video.src = downloadUrl;
      document.getElementById("story").appendChild(video);
      video.autoplay = true;
      video.controls = true;

This app can be deployed as an Azure Static Web App using the excellent Azure plugin for Visual Studio Code. And once it's live, you can tell durable shadow stories!