Pose estimation refers to computer vision techniques used for the detection of different human postures in images and videos. The technology does not encompass the recognition of who is in the image, rather it simply recognizes key-points in the body of the person who is in the image.




Demo will not work on mobile.
Match your pose to the frame. But do remember that the clock is ticking. You must clear all 5 levels in 30 seconds.



Posenet can be used either to detect single poses or multiple poses of the subjects present in the image, the single-pose detector being faster and simpler than its multiple-pose counterpart. For the sake of simplicity, the working demo is configured to single-pose detection problem. At a high level, pose estimation consists of 2 parts:

  1. An input RGB image is fed through a CNN.
  2. An algorithm is used to decode 4 final parameters: the posers, poser-confidence scores, coordinates of the key-points, and their corresponding confidence scores.
What is a poser?

A poser is a subject of the image that has a humanoid pose- humans and human-like figures can be considered as posers. To detect what the posture the subjects of the images have, the subjects themselves have to be identified first. A confidence score of 0.0 to 1.0 can be used as a threshold to set apart poses that are not deemed strong enough.

What is a key-point?

Markers in a human’s body that can be used to determine its posture configurations constitute of the key-points. Posenet detects 17 such key points- nose, left-eye, right-eye, left-ear, right-ear, left shoulder, right-shoulder... etc. Posenet makes an estimation in 2 dimensions, therefore only the x and y coordinates are returned. Also, a confidence score between 0.0 and 1.0 can be employed here also.



Fitness Training

From TomTom’s Spark to Jabra’s Elite Sport, to all manner of Fitbits and Garmin wearables, there are already gadgets that attempt to tell you how fit you are. With AI this fitness gauging should become way more sophisticated, flexible and useful while at work in terms of triggering individuals to maintain the right body postures throughout.

Sports Analytics

In sports, AI can be used for the evaluation of performed exercises during training or analysis of on field body movements. These parameters were applied for the development of intelligent optimised training methods, allowing an automatic assessment of the exercise technique, investigation of the quality of the execution and providing athletes and coaches with appropriate feedback.


This module must be used when only one human or human-like figure forms the subject of the image. Following are the inputs of the detector for single pose detection:

  1. An image, which must be square in shape.
  2. A scale factor having range 0.2 to 1.0, which is used to scale down the image before feeding through the network. A smaller value means that a smaller scaled-down version of the original input is fed to the CNN, that’s going to increase the speed of processing at the cost of accuracy.
  3. Image mirror to flip the image onto its mirror image. This can be used to return the actual orientation of the image clicked from a webcam or front-camera.
  4. Output-stride: Defaults to 16. Must be of value 8, 16, or 32. A greater stride value implies faster but more inaccurate processing of the image. When it is not 32, atrous convolution is used in the subsequent filters to create a larger resolution.
Single-Pose Detection AlgorithmPosenet Architecture: Single Pose Detection Algorithm

At a high-level, the posenet looks to generate a heatmap that gives probabilities or confidence scores of the presence of key-points in various regions of the image. When posenet processes an image, a heatmap along with a bunch of offset vectors is used to decode the areas which have a high-confidence in detecting key-points. The heatmap is a 3D tensor having a resolution given by the formula: Resolution = ((Input_size - 1)/output_stride) + 1

The offset vector is a 3D tensor of dimension (resolution x resolution x 34), the depth of the tensor being twice the number of key-points. Since the heatmap is an approximation of where the key-points are, the offset vectors corresponding to the key-points given by the heatmap are used to offset the prediction to the exact location in the image.

Heatmap and Offset Vector SimplificationPosenet Architecture: Heatmap and Offset Vector Simplification

Visualization of heatmap and offset-vector tensors. The depth of the tensor is 17, since 17 key-points are to be located. Thus, the dimension of the tensor (resolution x resolution x 17). It uses the MobileNet architecture to carry out this task; the image below shows the trade-off relation between speed and accuracy.

Output Stride and Heatmap ResolutionPosenet Architecture: Output Stride and Heatmap Resolution

So once the heatmap is generated, how do we decode it to get the poses?

  1. A sigmoid activation is used on the heatmap to smoothen it out.
    scores = heatmap.sigmoid()
  2. An argmax2D is used on the key-point confidence scores to get the x and y corresponding to the maximum score.
    heatmapPositions = scores.argmax(y, x)
  3. For each x and y pair, the offset pair is calculated. Therefore, we have 17 times 2 that is 34 elements which are required to calculate.
    offsetVector = [offsets.get(y, x, k), offsets.get(y, x, 17 + k)]
  4. Finally, the key-points are obtained by striding across the heatmap positions and adding the offset.
    keypointPositions = heatmapPositions * outputStride + offsetVectors




Build this in Big Brain in just 4 steps within 30 minutes:

1Click on DL Model Designer and Drag & Drop Layers/ Operations
2Configure the parameters and so and so..
3Validate the model architecture
4Train it for your datasets.
Builder Parameters


    Tensorflow logo
    Razorthink Logo