Application of Deep Learning using Convolutional Neural Network (CNN) Algorithm for Gesture Recognition

Gesture recognition is a fascinating method of human-computer interaction that goes beyond traditional means such as keyboards, pointers, and joypads. In gesture recognition, Convolutional Neural Network (CNN) algorithms are utilized in Deep Learning to train models using datasets comprising gesture images. The training process involves pattern recognition and identification of crucial features from gesture images, followed by evaluation to measure the model's accuracy. Gesture recognition holds immense potential across various fields, including human-computer interaction, gaming, healthcare, and autonomous vehicles, and continues to be a focus of research and development in the future.


INTRODUCTION
With the advancement of technology, there are increasingly more options available for interacting with computers (Setiawan, 2018).Gesture recognition is one of the alternatives that can be used to interact with computers (Ridwang, 2018).Gestures can be applied in various applications such as command systems, robotics, gaming, sign language, and so on.Machine learning is one of the applications of artificial intelligence (Khan et al., 2012).The use of machine learning in computer vision is closely related to deep learning, where computer scientists draw inspiration from the natural world (Arifin et al., 2021).
Machine learning can be categorized into three main categories: supervised learning, unsupervised learning, and reinforcement learning (Dasgupta and Nath, 2016).In supervised learning, models are trained using labeled data, where the input data is accompanied by corresponding target labels (Yan and Wang, 2022).Unsupervised learning, on the other hand, aims to find patterns and structures in unlabeled data without predefined outputs (Ando et al., 2005).Reinforcement learning involves an agent learning to make decisions based on rewards obtained from interacting with an environment (Abdulhai et al., 2003).These three categories provide a comprehensive framework for solving a wide range of problems and have contributed to significant advancements in artificial intelligence (Roihan et al., 2020).
Deep Learning is an algorithm in machine learning that utilises artificial neural networks (ANN) as its foundation (Wahyuni and Sulaeman, 2020).Artificial neural networks are structures commonly used for classification tasks (Ju et al., 2018).In this mechanism, the object to be classified is presented to the network through the activation of artificial neurons within the input layer (Choldun and Surendo, 2018).Convolutional Neural Network (CNN) is widely used for image classification, object recognition, and detection tasks (Aamir et al., 2018).CNN consists of three main layers: convolution, pooling, and classification (Hu et al., 2015).This research utilizes a real-time hand gesture recognition system based on OpenCV and employs the histogram of oriented gradients (HOG) and Haar Cascade classifier algorithms to classify various hand shapes (Rijanandi et al., 2023).

RESEARCH METHOD
The research method employed in this study is quantitative.The accuracy of training and validation is calculated using the CNN algorithm for the gesture recognition system (Kurniawan and Mustikasari, 2021).

Data gathering
Data collection for gesture recognition is conducted using the Leap Motion camera, as shown in Figure 3, with a resolution of 240x640 pixels.

Features Extraction
Features Extraction is performed by creating a CNN model consisting of two main parts: feature extraction and classification (Al-Doori et al., 2021).The feature extraction part includes convolutional layers and pooling layers, as shown in Figure 2.

Classification
Classification consists of two main layers, namely the flatten layer and the dense layer, which serve as the output of the prediction model created, as shown in Figure 2. Subsequently, a test will be conducted (Kaliyar et al., 2021).

Gesture Recognized
The gestures will be recognized after conducting a test on the model, which has been evaluated for accuracy and validation.

RESULTS AND DISCUSSION
The data obtained from the Leap Motion device consists of grayscale images with a resolution of 240x640 pixels, as shown in Figure 3.The dataset contains a total of 6000 images.In the next step, the data will be trained to obtain the desired model.During the training phase, the image data will be stored in an array and undergo features extraction using convolutional layers and pooling layers (Mesut et al., 2020).The extracted features will then be used for classification.In the classification stage or in a fully-connected layer, the desired classification results will be obtained and will be used for gesture recognition (Barbhuiya et al., 2021).