Case Study: Counting People in Motion

Counting is one of the basic tasks of computer vision (CV). CV uses image processing algorithms to visually make sense of the world, from counting objects like cars or signs to products and people. At StreamLogic, people are what we are most often interested in counting because it helps us understand the demographics of customer behavior. We count people for attendance at events, to determine capacity usage of rooms or public transportation, to analyze shopping habits, and more.

The Project

Counting individuals can be tricky because the majority of time people are in motion. That’s why for this case study, we used the StreamLogic platform to create a solution for counting unique individuals passing by a street camera as recorded in this sample video. The solution demonstrates the use of deep learning models, object tracking, and unsupervised learning. Key challenges of the application were detecting people across frames and identifying multiple appearances of the same person.

Use Cases


Streamlogic counts people to determine optimal capacity, helping prevent overcrowdedness and eliminate fire hazards. The technology can also help identify when a person is somewhere they should not be. These functionalities work to alleviate safety concerns.


Counting people also provides greater insights into the day-to-day, week-to-week, and even seasonal business trends. If you need to know how many resources to allocate at a given time, having an accurate understanding of the quantity of people will help companies anticipate business needs and make sounder decisions.  

The Process



Detecting People

Our first step is to determine how to identify people within an image. To accomplish this, we chose to use face detection rather than detecting the whole body. In many applications, the body is only partially visible and faces are more often completely in view. Face detection has also been studied for longer and has become widespread in the last decade, making the technology more reliable.

With StreamLogic, we can choose between face detection algorithms using traditional image processing (a feature descriptor called the histogram of oriented gradients) or the more recent deep neural networks (known as DNN). Both of these approaches involve machine learning to train a model that takes an image as input and outputs a list of bounding boxes around the regions of the image that contain a human face. Since there are pre-trained models available, we chose to use a pre-trained DNN model for this project.  



Detecting People Across Frames

With StreamLogic’s face detection algorithm, we can easily count the number of faces in a single image. However, this introduces a new challenge: how do we extend this capability to video?

Since video is just a sequence of individual frames, we are able to apply the face detection algorithm to each frame. And because we are counting people, we only need to find faces that have not been counted already in previous frames. One simple way to perform this task is to keep the locations of the faces from the previous frame, detect faces in the new frame, and discard any faces that overlap by a certain amount when evaluating both frames.


Extending the face detection algorithm to video introduced new challenges. While this approach provides an estimate number of people, there were a couple key issues. In a crowded environment, overlap may not be a good indication of which faces are the same across different frames. This can lead to undercounting. Another prevalent issue is that the face detection is computationally expensive and it’s preferable to avoid applying it to every frame.

Solution: Object Tracking

A more sophisticated approach to detecting people across frames is to employ object tracking. Object tracking algorithms are developed specifically for tracking unique objects from one frame to the next in a video. These algorithms can run faster than face detection and are based not just on the object’s location, but also its visual appearance. That’s why StreamLogic employs the tracking algorithm available in the DlibC++ library. This algorithm tracks objects by searching for the object in an area around the location of the object in the previous frame. Not only can it handle objects moving between frames, but it can also capture objects that get larger/smaller as they move closer/farther away.



Algorithm I

We combined the face detection method with object tracking and created the following algorithm to count the people moving past the street camera in the sample video:


  1. Initialize count and object tracker
  2. For each video frame:
  1. Update object tracker with new video frames
  2. Every Nth video frame:
  1. Run face detection
  2. Match faces with locations of objects already being tracked
  3. Increment count for each new face
  4. Add each new face to object tracker
  5. Remove objects from tracker not matched


This algorithm described in the steps above is a great start when it comes to counting people. But it suffers from one major problem still: as people are moving through the video, particularly in a crowd, faces may get occluded by other people or objects. For example, in the clip below, the man in the middle is initially visible, however the woman in front of him blocks him as she passes.

As a result, the face is not detected during this period or frame. When the man’s face reappears once the woman has moved ahead, it gets counted again as a new face. Consequently, using Algorithm I will frequently overcount people. To better articulate this point, there are 8 unique people in the sample clip and this algorithm counts 13.

Solution: Deduplication Via Clustering

To combat this overcounting, we need to be able to determine if two face images are of the same person. Fortunately, there are a number of ways to compare two images for similarity. In the case of face images, Carnegie Mellon University (CMU) has developed a simulator model specifically for faces called OpenFace. The CMU model takes a face image as input and produces a numeric signature for the face. Similar to handwritten signatures, the OpenFace “signature” for two images of the same person’s face is not exactly the same, but can be compared for similarity.

In order to incorporate face similarity into algorithm I to reduce overcounting, we could have done it by comparing each new face to previously seen faces. However, with the StreamLogic platform, we decided to split the problem into two stages:

  1. Use algorithm I to collect faces as they appear in the video.
  2. Deduplicate the collected faces using the OpenFace similarity measure once.

The deduplication stage is based on clustering, a machine learning technique for grouping a set of things based on how similar they are. Clustering can be done for any set of objects as long as you have a function that can compute the similarity between two objects. And OpenFace gives us exactly that for face images because the output is a high-dimension numeric vector.

The following image shows a 2-dimension visualization of the OpenFace vectors generated for the faces extracted from the sample video:

Each point represents one face and the color represents the cluster it was assigned to.

Using the clustering algorithm, we are able to organize faces into clusters based on how similar they are because we can expect to get one cluster per unique person. Thus, the number of clusters is the number of unique people.



Algorithm II

We combined  algorithm I with the deduplication stage described above to produce the final algorithm:

  1. Run algorithm I to collect all new face images appearing in the video
  2. Cluster the face images using the OpenFace similarity measure
  3. Set the person count to the number of clusters


After incorporating the clustering stage into the original algorithm, the output for the sample video is now counting a total of 9 individuals which is closer to the correct total: 8. Ultimately, the final algorithm proves to be far more accurate than algorithm I which did not use clustering and consequently overcounted 13 people in the sample video. What’s more, we were able to accomplish algorithm II using only readily available models without any training. Moving forward, accuracy can be further improved by training (or fine tuning) models for the specific context of the application.