Human-Activity-Recognition using Transfer Learning and Sequential models
Domain Background:
website: https://eliteaihub.com/
Video Recognition ? Intuitively, videos are nothing but a running sequence of frames. So, we can say that a video is composed of multiple frames, stacked after one another. Now, an image (technically a digital image) is a 2-d array of pixels. These pixels may represent gray levels, intensity values etc., that make up the image. Also, if we are dealing with color images, there are three 2-dimensional arrays of these pixels, one for each of the 3 channels — Red(R), Green(G) and Blue(B).
Transfer Learning ? In simple terms, transfer learning is the process of training a model on a large-scale dataset and then using that pretrained model to conduct learning for another downstream (relatable)task .
strategies implied for Transfer-Learning:
I have used 3rd strategy for the training my model.
Dataset :The video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4. The videos were captures at a frame rate of 25fps and each frame was down-sampled to the resolution of 160x120 pixels.Link to dataset -http://www.nada.kth.se/cvap/actions/
Model Architecture:
Methodology:
- Reading in the video, frame by frame.
- The videos were captured at a frame rate of 25fps. This means that for each second of the video, there will be 25 frames. We know that within a second, a human body does not perform very significant movement. This implies that most of the frames (per second) in our video will be redundant. Therefore, only a subset of all the frames in a video needs to be extracted. This will also reduce the size of the input data which will in turn help the model train faster and can also prevent over-fitting.
- Different strategies would be used for frame extraction like: Extracting a fixed number of frames from the total frames in the video: say only the first 200 frames (i.e., first 8 seconds of the video).Extracting a fixed number of frames each second from the video: say we need only 5 frames per second from a video whose duration is of 10 seconds. This would return a total of 50 frames from the video. This approach is better in the sense that we are extracting the frames sparsely and uniformly from the entire video.
- Each frame needs to have the same spatial dimensions (height and width). Hence each frame in a video will have to be resized to the required size.
- In order to simplify the computations, the frames can be converted to gray scale.
- Normalization: The pixel values ranges from 0 to 255. These values would have to be normalized in order to get a better performance from our network. Different normalization techniques can be applied like:
- Make prediction on each frame using selected pretrained model (I used inception model on Image-net dataset).Since Transfer-Learning approch is used we won’t extract final classification instead extract the outputs(feature vectors)of last pooling layer which is feature vector of length 2048.below represents part of inception.summary()
Until now, we had a feature map of a single frame.We take a group of frames in order to classify not the frame but a segment of the video.We stack the feature maps of all frames corresponding to specific video to generate tensor(concatenate feature maps)which will be the input of our second neural network, the recurrent one, to obtain the final classification of our system.
Results:
Accuracy — 88.37 %