• Corpus ID: 207982899

Mobile Object Detection using TensorFlow Lite and Transfer Learning

  • Oscar Alsing
  • Published 2018
  • Computer Science, Engineering

Figures and Tables from this paper

figure 2.10

41 Citations

Butterfly classification with machine learning methodologies for an android application, towards image classification with machine learning methodologies for smartphones, a novel image model for vehicle classification in restricted areas using on-device machine learning, detection and classification of urban actors through tensorflow with an android device, smartphone-based real-time object recognition architecture for portable and constrained systems, information measure computation and its impact in mi coco dataset.

  • Highly Influenced

Performance Analysis and Comparison of Faster R-CNN, Mask R-CNN and ResNet50 for the Detection and Counting of Vehicles

Deeplite: real-time deep learning framework for neighborhood analysis, an orchestrated empirical study on deep learning frameworks and platforms, real time face mask detection system using transfer learning with machine learning method in the era of covid-19 pandemic, 44 references, rich feature hierarchies for accurate object detection and semantic segmentation.

  • Highly Influential
  • 11 Excerpts

Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks

Faster r-cnn: towards real-time object detection with region proposal networks, ssd: single shot multibox detector, visualizing and understanding convolutional networks, cnn features off-the-shelf: an astounding baseline for recognition, what makes imagenet good for transfer learning, imagenet classification with deep convolutional neural networks, deep residual learning for image recognition, related papers.

Showing 1 through 3 of 0 Related Papers

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: sign language recognition system using tensorflow object detection api.

Abstract: Communication is defined as the act of sharing or exchanging information, ideas or feelings. To establish communication between two people, both of them are required to have knowledge and understanding of a common language. But in the case of deaf and dumb people, the means of communication are different. Deaf is the inability to hear and dumb is the inability to speak. They communicate using sign language among themselves and with normal people but normal people do not take seriously the importance of sign language. Not everyone possesses the knowledge and understanding of sign language which makes communication difficult between a normal person and a deaf and dumb person. To overcome this barrier, one can build a model based on machine learning. A model can be trained to recognize different gestures of sign language and translate them into English. This will help a lot of people in communicating and conversing with deaf and dumb people. The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset.
Comments: 14 pages, 5 figures, ANTIC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

Object Detection

This Colab demonstrates use of a TF-Hub module trained to perform object detection.

Imports and function definitions

Toggle code

Example use

Helper functions for downloading images and for visualization..

Visualization code adapted from TF object detection API for the simplest required functionality.

Apply module

Load a public image from Open Images v4, save locally, and display.

png

Pick an object detection module and apply on the downloaded image. Modules:

  • FasterRCNN+InceptionResNet V2 : high accuracy,
  • ssd+mobilenet V2 : small and fast.

png

More images

Perform inference on some additional images with time tracking.

png

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-03-09 UTC.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 August 2024

MPE-YOLO: enhanced small target detection in aerial imaging

  • Yichang Qin 1 ,
  • Ze Jia 1 &
  • Ben Liang 1  

Scientific Reports volume  14 , Article number:  17799 ( 2024 ) Cite this article

396 Accesses

Metrics details

  • Aerospace engineering
  • Electrical and electronic engineering

Aerial image target detection is essential for urban planning, traffic monitoring, and disaster assessment. However, existing detection algorithms struggle with small target recognition and accuracy in complex environments. To address this issue, this paper proposes an improved model based on YOLOv8, named MPE-YOLO. Initially, a multilevel feature integrator (MFI) module is employed to enhance the representation of small target features, which meticulously moderates information loss during the feature fusion process. For the backbone network of the model, a perception enhancement convolution (PEC) module is introduced to replace traditional convolutional layers, thereby expanding the network’s fine-grained feature processing capability. Furthermore, an enhanced scope-C2f (ES-C2f) module is designed, utilizing channel expansion and stacking of multiscale convolutional kernels to enhance the network’s ability to capture small target details. After a series of experiments on the VisDrone, RSOD, and AI-TOD datasets, the model has not only demonstrated superior performance in aerial image detection tasks compared to existing advanced algorithms but also achieved a lightweight model structure. The experimental results demonstrate the potential of MPE-YOLO in enhancing the accuracy and operational efficiency of aerial target detection. Code will be available online (https://github.com/zhanderen/MPE-YOLO).

Similar content being viewed by others

research paper on object detection using tensorflow

Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic

research paper on object detection using tensorflow

Centralised visual processing center for remote sensing target detection

research paper on object detection using tensorflow

Lightweight aerial image object detection algorithm based on improved YOLOv5s

Introduction.

Aerial images, acquired through aerial photography technology, feature high-resolution and extensive area coverage, providing critical support to fields such as traffic monitoring 1 and disaster relief 2 through the automated extraction and analysis of geographic information. With continuous advancements in remote sensing technology, aerial image detection offers valuable data support for geographic information systems and related applications, playing a significant role in enhancing the identification and monitoring of surface objects and the development of geographic information technology.

Aerial images are characterized by complex terrain, varying light conditions, and difficulties in data acquisition and storage. However, the high-dimensionality and massive volume of aerial image data pose numerous challenges to image detection, particularly because aerial images often contain small targets, making detection even more challenging 3 . In light of these issues, target detection algorithms are increasingly vital as the core technology for aerial image analysis.

Traditional object detection algorithms often rely on manually designed feature extraction methods such as scale-invariant feature transform (SIFT), and speeded up robust feature (SURF). These methods represent targets by extracting local features from images but might fail to capture higher-level semantic information. Machine learning approaches such as support vector machines (SVMs) 4 , random forests 5 , etc., have effectively improved the accuracy and efficiency of aerial detection, but struggle with the detection of complex backgrounds. With the rapid development of deep learning technology, neural network-based image object detection methods have become mainstream. The end-to-end learning capability of deep learning allows algorithms to automatically learn and extract more abstract and higher-level semantic features, replacing traditionally manually designed features.

Deep learning-based object detection algorithms can be divided into single-stage and two-stage algorithms. The two-stage algorithms are represented by the R-CNN 6 , 7 , 8 series, which adopts a two-stage detection process; First candidate regions are created via the region proposal network (RPN), and then the location and classification are fine-tuned through classifiers and regressors. Such algorithms can precisely locate and identify various complex land objects, especially when dealing with small or densely arranged targets, and have received widespread attention and application. However, two-stage detection algorithms still have room for improvement in terms of speed and efficiency. Single-stage detection algorithms, represented by SSD 9 and YOLO 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 series, approach object detection as a regression problem and predict the categories and locations of targets directly from the global image, enabling real-time detection. These algorithms offer good real-time performance and accuracy, and are particularly suitable for processing large-scale aerial image data. They hold significant application prospects for quickly obtaining geographic information, monitoring urban changes, and natural disasters. However, single-stage object detection algorithms still face challenges in the accurate detection and positioning of small targets.

In the context of UAV aerial imagery, object detection encounters several specific challenges:

Dense small objects and occlusion Images captured from low altitudes often contain a large number of dense small objects, particularly in urban or complex terrains. Due to the considerable distance, these objects appear smaller in the images and are prone to occlusion. For instance, buildings might obscure each other, or trees might cover parked vehicles. Such occlusion leads to partial hiding of target object features, thereby affecting the performance of detection algorithms. Even advanced detection algorithms struggle to accurately identify and locate all objects in highly dense and severely occluded environments.

Real-time requirements vs. accuracy trade-off UAV aerial image object detection must meet real-time requirements, particularly in monitoring and emergency response scenarios. Achieving real-time detection necessitates a reduction in algorithmic computational complexity, which frequently conflicts with detection accuracy. High-accuracy detection algorithms typically require substantial computational resources and time, whereas real-time demands necessitate algorithms that can process vast amounts of data swiftly. The challenge lies in maintaining high detection accuracy while ensuring real-time performance. This requires optimization in network architecture to balance the number of parameters and accuracy effectively.

Complex backgrounds Aerial images often include a significant amount of irrelevant background information like buildings, trees, and roads. The complexity and diversity of background information can interfere with the correct detection of small objects. Moreover, the features of small objects are inherently less pronounced. Traditional single-stage and two-stage algorithms primarily focus on global features and may overlook the fine-grained features crucial for detecting small objects. These algorithms often fail to capture the details of small objects, resulting in lower detection accuracy. Therefore, there is a pressing need for more advanced deep learning models and algorithms that can handle these subtle features, thereby enhancing the accuracy of small object detection.

To address the aforementioned issues, this study proposes an algorithm called MPE-YOLO, which is based on the YOLOv8 model, and enhances the detection accuracy of small objects while maintaining a lightweight model. The main contributions of this study are as follows.

We developed a multilevel feature integrator (MFI) module with a hierarchical structure to merge image features at different levels, enhancing scene comprehension and boosting object detection accuracy.

A perception enhancement convolution (PEC) module is proposed, which uses multislice operations and channel dimension concatenation to expand the receptive field, thereby improving the model’s ability to capture detailed target information.

By incorporating the proposed enhanced scope-C2f (ES-C2f) operation and introducing an efficient feature selection and utilization mechanism, the selective use of features is further enhanced, effectively improving the accuracy and robustness of small object detection.

After comprehensive comparative experiments with various other object detection models, MPE-YOLO has demonstrated a significant improvement in performance , proving its effectiveness.

The rest of this paper includes the following content: Section 2 briefly introduces the recent research results on aerial image detection and the main idea of YOLOv8. Section 3 introduces the innovations of this paper. Section 4 describes the experimental setup, including the experimental environment, parameter configuration, datasets used, and performance evaluation metrics, and presents detailed experimental steps and results, verifying the effectiveness of the improvement strategies. Section 5 summarizes the main contributions of this research and discusses future directions of work.

Background and related works

Related works.

Deep learning-based object detection algorithms are widely applied in fields such as aerial image detection, medical image processing, precision agriculture, and robotics due to their high detection accuracy and inference speed. The following are some algorithms used in aerial image detection: Cheng et al. 18 proposed a method combining cross-scale feature fusion to enhance the network’s ability to distinguish similar objects in aerial images. Guo et al. 19 presented a novel object detection algorithm that improves the accuracy and efficiency of highway intrusion detection by refining feature extraction, feature fusion, and computational complexity methods. Sahin et al. 20 introduced YOLODrone, an improved version of the YOLOv3 algorithm that increases the number of detection layers to enhance the model’s capability to detect objects of various sizes, although this adds to the model’s complexity. Chen et al. 21 enhanced the feature extraction capability of the model by optimizing residual blocks in the multi-level local structure of DW-YOLO and improved accuracy by increasing the number of convolution kernels. Zhu et al. 22 incorporated the CBAM attention mechanism into the YOLOv5 model to address the issue of blurred objects in aerial images. Additionally, Yang 23 enhanced small object detection capability by adding upsampling in the neck part of the YOLOv5 network. And integrated an image segmentation layer into the detection network. Lin et al. 24 proposed GDRS-YOLO, which first constructs multi-scale features through deformable convolution and gathering-dispersing mechanisms, and then introduces normalized Wasserstein distance for mixed loss training, effectively improving the accuracy of object detection in remote sensing images. Jin et al. 25 improved the robustness and generalization of UAV image detection under different shooting conditions by decomposing domain-invariant features, domain-specific features, and using balanced sampling data augmentation techniques. Bai et al.’s CCNet 26 suppresses interference in deep feature maps using high-level RGB feature maps while achieving cross-modality interaction, enhancing salient object detection.

In the field of medical image processing, typical object detection algorithms include: Pacal et al. 27 demonstrated that by improving the YOLO algorithm and using the latest data augmentation and transfer learning techniques, the efficiency and accuracy of polyp detection could be significantly enhanced. Xu et al. 28 showed that the improved Faster R-CNN model exhibited excellent performance in lung nodule detection, particularly in small object detection capability and overall detection accuracy.Xi et al. 29 improved the sensitivity of small object detection by introducing a super-resolution reconstruction branch and an attention fusion module in the MSP-YOLO network. In the agricultural field, Zhu et al. 30 demonstrated how to achieve high-precision drone control systems through a combination of hardware and software. Its application in agricultural spraying provides a reference for the performance of automated control systems in practical applications. In the field of robotics, Wang et al. 31 researched robotic mechanical models and optimized jumping behavior through bionic methods. This combination of biological observation and mechanical modeling can inspire the development of other robots or systems that require motion optimization, using bionic mechanisms to achieve efficient and reliable motion control.

The aforementioned methods face challenges such as the limitations of the receptive field and insufficient feature fusion in highly complex backgrounds or dense small object scenes, resulting in poor performance in low-resolution and densely occluded situations. Driven by these motivations, we propose an algorithm called MPE-YOLO that improves the detection accuracy of small objects while maintaining a lightweight model. Numerous experiments have demonstrated that by integrating multilevel features and strengthening detail information perception modules, we can achieve higher detection accuracy across different datasets.

figure 1

YOLOv8 network structure.

YOLOv8 is the latest generation of object detection algorithms developed by Ultralytics, and officially released on January 10, 2023. YOLOv8 improves upon YOLOv5 by replacing the C3 module with the C2f module. The head utilizes a contemporary decoupled head structure, separating classification and detection heads, and transitions from an anchor-based to an anchor-free approach, resulting in higher detection accuracy and speed. The YOLOv8 model comprises an input layer, a backbone network, a neck network, and a head network, as shown in Fig.  1 . The input image is first resized to 640x640 to meet the size requirements of the input layer, and the backbone network achieves downsampling and feature extraction via multiple convolutional operations, with each convolutional layer equipped with batch normalization and SiLU 32 activation functions. To improve the network’s gradient flow and feature extraction capacity, the C2f block was introduced, drawing on the E-ELAN structure from YOLOv7, and employing multilayer branch connections. Furthermore, the SPPF 33 block is positioned at the end of the backbone network and combines multiscale feature processing to enhance the feature abstraction capability. The neck network adopts the FPN 34 and PAN 35 structures for effective fusion of different scale feature maps, which are then passed on to the head network. The head network is designed in a decoupled manner, including two parallel convolutional branches that handle regression and classification tasks separately to improve focus and performance on each task. The YOLOv8 series offers five different scaled models for users to choose from, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Compared to other models, YOLOv8s strikes a balance between accuracy and model complexity. Therefore, this study chooses YOLOv8s as the baseline network.

Methodology

figure 2

MPE-YOLO network structure.

In response to the need for detecting small objects in aerial and drone imagery, we propose the MPE-YOLO algorithm to adjust the structure of the original YOLOv8 components. As shown in Fig.  2 , by designing the multilevel feature integrator (MFI) module, the representation and information fusion of small target features are optimized, so as to reduce the information loss in the process of feature fusion. The introduction of the perception enhancement convolution (PEC) module replaces the traditional convolutional layer, expands the ability of fine-grained feature processing of the network, and significantly improves the recognition accuracy of small targets in complex backgrounds. We replaced the last two downsampling layers and the detection layer for 20*20 size targets in the backbone network with a detection layer for small 160*160 size targets. This enables the model to focus more on the details of small targets. Finally, through the enhanced scope-C2f (ES-C2f) module, the feature extraction efficiency and operation efficiency of the model are further improved by using channel expansion and the stacking of multi-scale convolution kernels. Combining these improvements, MPE-YOLO performs well in small object detection tasks in complex environments, and significantly improves the accuracy and performance of the model. To differentiate from the baseline model, MPE-YOLO marks the improved modules with darker colors. The gray area at the bottom represents the removal of the 20*20 detection head, while the yellow area at the top represents the addition of the 160*160 detection head.

Multilevel feature integrator

In object detection tasks, the feature representation of small objects is often unclear due to size restrictions, which can lead to them being overlooked or lost in the feature fusion process, resulting in decreased detection performance. To effectively address this issue, we adopted the structure of Res2Net 36 and designed an innovative multilevel feature integrator (MFI). The structure of the MFI module, as shown in Fig.  3 , aims to optimize the feature representation and information fusion of small objects through a series of detailed strategies, reducing the loss of feature information and suppressing redundancy and noise.

figure 3

Multilevel feature integrator structure.

First, the MFI module uses convolutional operations to reduce the channel dimensions of the input feature maps, simplifying the subsequent computation process. Immediately following, the reduced feature maps are uniformly divided into four groups (Group 1 to Group 4), with each group containing 25% of the total number of original feature maps. This partition is not random, but a uniform segmentation of the number of channels based on the feature map, aiming to optimize the computational efficiency and the subsequent feature fusion effect. We use a squeeze convolution layer to shape and compress the feature maps from all groups, resulting in output Out1, which aims to focus on key target features, reduce feature redundancy, and preserve details helpful for small object detection. Second, by performing proportional feature fusion of Group 1 and Group 2, we construct complex low-level feature representations, forming the output part Out2, and enhancing the feature details of small objects. Additionally, the bottleneck module 17 is applied to Group 3 to refine high-level semantic information, and produce Out3. This advanced feature output helps capture richer contextual information, improving the detection efficiency of small objects.

Out4 is obtained by fusing the high-level features from Out3 with the Group4 features and then processing them again through the bottleneck module. The purpose of this step is to integrate the low-level features with the high-level features, enabling the model to understand the characteristics of small objects more comprehensively. Then by concatenating and integrating the four different levels of outputs-Out1, Out2, Out3, and Out4-in the channel direction, the features of all the scales are fully utilized, thereby improving the overall performance of the model in small object detection tasks.

Ultimately, MFI module adopts a channel-wise feature integration approach to aggregate features from various levels, enhancing the ability to recognize different target behaviors, particularly improving the accuracy of capturing small object behaviors and interactions in dynamic scenes.

Perception enhancement convolution

figure 4

Perception enhancement convolution structure.

When dealing with multiscale object detection tasks, traditional convolutional neural networks typically face challenges such as fixed receptive fields 37 , insufficient use of context information, and limited environmental perception. In particular, in the detection of small objects, these limitations can significantly suppress the performance of the model. To overcome these issues, we introduce Perception-Enhanced Convolution (PEC), as shown in Fig. 4 , which is a module specifically designed for the backbone network and intended to replace traditional convolutional layers. The main advantage of PEC is that it introduces a new dimension during the phase of extracting primary features in the model, which can significantly expand the receptive field and more effectively integrate context information, thus further deepening the model’s understanding of small objects and their environment.

In detail, the PEC module begins by precisely cutting the input feature map into four smaller feature map blocks, each of which is reduced in size by half in the spatial dimension. This cutting process involves the selection of specific pixels, ensuring that representative information from the top-left, top-right, bottom-left, and bottom-right of the original feature map is captured separately in each channel. Through such a meticulous division of the spatial dimension, the resulting small blocks retain important spatial information while ensuring even coverage of information. Subsequently, these small blocks are concatenated in the channel dimension to form a new feature map, with an increased number of channels but reduced spatial resolution, thus significantly reducing the computational burden while maintaining a large receptive field.

To further enhance feature expressiveness and computational efficiency, a squeeze layer is integrated into the PEC, which reduces model parameters by compressing feature dimensions while ensuring that key features are emphasized even as the model is simplified. For deeper feature extraction, we apply the classic bottleneck structure, which not only refines the hierarchical representation of features but also significantly enhances the model’s sensitivity and cognitive ability for small objects, further boosting the computational efficiency of features.

Overall, through the PEC module, the model is endowed with stronger environmental adaptability and understanding of object relations. The innovative design of the PEC enables feature maps to obtain more comprehensive and detailed information on targets and the environment while expanding the receptive field. This is particularly crucial in areas such as traffic monitoring for object classification and behavior prediction, as these aeras greatly depend on accurate interpretations of subtle changes and complex scenes.

Enhanced Scope-C2f

figure 5

Enhanced Scope-C2f structure.

In the YOLOv8 model, researchers designed the C2f module 17 to maintain a lightweight network while obtaining richer gradient flow information. However, when dealing with small targets or low-contrast targets in aerial images, this module does not sufficiently express fine features, affecting the detection accuracy of targets with complex scales. To address this issue, this study proposes an improved module called Enhanced Scope-C2f (ES-C2f), as shown in Fig.  5 , which focuses on improving the network’s ability to capture details and feature utilization efficiency, especially in expressing small targets and low-contrast targets.

The ES-C2f module enhances the network’s representation capability for targets by expanding the channel capacity of feature maps, enabling the model to capture more subtle feature variations. This strategy is dedicated to enhancing the network’s sensitivity to small target details and improving the adaptability to low-contrast target environments through a wider range of feature representations.

To expand the channel capacity while considering computational efficiency, the ES-C2f module cleverly integrates a series of squeeze layers. These layers perform intelligent selection and compression of feature channels, not only streamlining feature representations but also preserving the capture of key information. The design of this feature operation fully considers the need to enhance identification capabilities while reducing model complexity and computational load. ES-C2f further employs a strategy of stacking multiscale convolutional kernels as well as combining local and global features. This provides an effective means to integrate features at different levels, enabling the model to make decisions on a richer feature dimension. Deep semantic information is cleverly woven with shallow texture details, enhancing the perception of scale diversity.

An optimized squeeze layer is introduced at the end of the module to further refine the essence of the features and adapt to the needs of subsequent processing layers. This engineering not only enhances the feature representation capacity but also improves the information decoding efficiency of subsequent layers, allowing the model to detect and recognize targets with greater precision. With the improvements made to the original C2f module in the YOLOv8 architecture, the proposed ES-C2f module provides a more effective solution for small targets and low-contrast scenes. The ES-C2f module not only maintains the lightweight structure and response speed of the model in extremely challenging scenarios but also significantly improves the overall recognition ability for complex-scale target detection.

Experiments

Experimental setup.

The batch size was set to 4 to avoid memory overflow, the learning rate was set to 0.01, the learning rate was adjusted by the cosine annealing algorithm, the momentum of the stochastic gradient descent (SGD) was set to 0.937, and the mosaic method was used for data augmentation. The resolution of the input graphics is uniformly set to 640 \(\times \) 640. A total of 200 epochs were trained on all models, and no pretrained models were used in training to ensure the fairness of the experiment. We opted for random weight initialization, ensuring that the initial weights of each model originate from the same distribution. Although the specific initial values differ, this guarantees that all models start from a fair and balanced point, enabling comparison under identical training conditions without the influence of historical biases from pretrained models. Pretrained models are typically trained on large datasets that may not align with our target dataset distribution, potentially introducing unforeseen biases. Therefore, we decided against using pretrained models. To mitigate the impact of randomness in weight initialization, we conducted multiple independent experiments and averaged the results. Table  1 lists the training environment configurations.

To ensure the rationality of the experimental data, this article selected three representative public datasets for experiments, namely VisDrone2019 38 , RSOD 39 , and AI-TOD 40 . VisDrone2019, as the main dataset of this experiment, was subjected to very detailed comparative and ablation studies. To validate the generalizability and universality of the model, experiments were conducted on the RSOD and AI-TOD datasets.

Considering the consistency of the dataset and the continuity of the study, we selected the VisDrone2019 dataset, it collected and released by Tianjin University’s Machine Learning and Data Mining Lab, comprises a total of 8629 images. Among them, 6471 images were used for training, 548 images were used for validation, and 1610 images were used for testing. The dataset encompasses 10 categories from daily scenes-pedestrian, person, bicycle, car, van, truck, tricycle, awning tricycle, bus, and motorcycle. In this dataset, the proportion of categories is unbalanced, and most images contain small targets, making detection difficult.

The RSOD dataset is a public dataset released by Wuhan University in 2017, it consists of 976 optical remote sensing images taken from Google Earth and Tianditu, and is composed of four object classes: aircraft, oiltank, overpass, and playground, totalling 6950 targets. To increase the number of samples, the dataset was expanded by means of rotation, translation, and splicing, increasing the total to 2000 images. To avoid data leakage issues, data augmentation is performed only on the training set, and the validation and test sets remain in their original state. Then randomly split it into training, validation, and test sets at a ratio of 8:1:1, with the training set comprising 1,600 images and both the validation and test sets containing 200 images each.

The AI-TOD dataset is a specialized remote sensing image dataset focused on tiny objects, consisting of 28,036 images and 700,621 targets. These targets are divided into eight categories: bridge, ship, vehicle, storage-tank, person, swimming-pool, wind-mill, and airplane. Compared to other aerial remote sensing datasets, the average size of targets in AI-TOD is approximately 12.8 pixels, which is significantly smaller than that in other datasets, increasing the difficulty of detection. The dataset is divided into training, validation, and test sets at a ratio of 6:1:3.

Evaluation criteria

We selected mAP0.5, mAP0.5:0.95, and APs as indicators to measure the model’s accuracy in small target detection. To evaluate the model’s efficiency, we used the number of parameters and model size as indicators of its lightweight nature. Additionally, latency was chosen to assess the model’s real-time detection performance.

Precision is the ratio of the number of samples correctly predicted as positive to the number of all samples predicted as positive. The formula is as follows:

Recall is the ratio of the number of samples correctly predicted as positive to the number of samples of all true cases. The formula is as follows:

TP (true positives) represents the number of correctly identified positive instances, FP (false positives) represents the number of incorrectly identified negative instances as positive, and FN (false negatives) represents the number of incorrectly identified positive instances as negative.

mAP refers to the average AP of all defect categories, AP refers to the area of the curve below the precision recall curve, and the formula for AP and mAP is as follows, the greater the mAP is, the better the comprehensive detection performance of the model in all categories, the specific formula is as follows:

The APs metric is the average accuracy of calculating the detection results of small objects, and this metric can help us understand how well the model performs when detecting small objects. The number of parameters represents the number of the parameters used by the model, measured in millions. The number of parameters provides a direct indicator of the complexity of the model, a greater number of parameters usually means greater representation power, but can likewise lead to longer training times and the risk of overfitting. Model size usually refers to the size of the model file stored on disk and is usually quantified in megabytes (MB). Model size reflects the amount of storage space the model occupies, which is especially important in resource-constrained environments such as mobile devices or where the model needs to be deployed to embedded devices. Latency refers to the time it takes to process a frame in object detection, and is one of the metrics to measure whether a model can meet real-time detection.

Ablation atudy

To validate the effectiveness of the proposed module in aerial image detection, we conducted ablation studies for each module, using the YOLOv8s model as the baseline. The experimental results are shown in Table  2 , where ✓ indicates the addition of the module to the model, A represents adding the MFI module, B represents improving the network structure, C represents adding the PEC module, and D represents adding the ES-C2f module.

By incorporating the multilevel feature integrator (MFI) module, experiments demonstrate a notable enhancement in small object detection performance, notably reflected in a 1.6% increase in mean average precision ([email protected]) and a 0.9% increase in [email protected]:0.95. Simultaneously, the total number of model parameters is reduced by 0.8 million, and the model size decreases by 1.6 megabytes. Additionally, it reduces latency to 8.5 milliseconds, indicating that the MFI module has optimized the model’s computational efficiency and feature extraction capabilities, particularly in integrating multi-level semantic information and reducing redundant calculations.

By optimizing the network structure, removing redundant deep feature mappings, and introducing detection heads optimized for small object detection, the precision of the model is significantly enhanced, as is the model’s ability to capture low-frequency detail information. These changes resulted in an improvement of 1.8% in mAP0.5 and 1.3% in mAP0.5:0.95. By compressing the number of channels and reducing the number of network layers, the model can abstractly extract semantic information from deeper feature maps, further enhancing the recognition of small objects. The simplification of the structure not only reduced the parameter count by 7.2 M but also reduced the model size to 6.3 MB. However, an increase in latency to 12ms suggests that the addition of a specific small object detection head has led to an increase in latency.

Subsequently, by introducing the PEC module, the feature maps are finely sliced and fused along the channel dimension, enhancing the spatial integrity and richness of the features. At the same time, with the introduction of squeeze layers, we compress key information while reducing computational complexity, thus improving the efficiency of feature processing. By using the bottleneck structure for deep feature processing, the small object detection and processing capabilities of the module are enhanced, and the complexity of the model increases only slightly compared to that of the baseline model, maintaining the latency at 12.5 ms, resulting in a 1.2% improvement in the mAP0.5 and a 0.7% improvement in the mAP0.5:0.95. This result shows that even with a slight increase in complexity, the PEC module achieves a significant improvement in the accuracy of small object detection, especially in complex scenarios, where the model’s performance has been effectively improved.

Finally, by integrating the ES-C2f module, the model can combine the advantages of \(3 \times 3\) and \(5 \times 5\) convolutional kernels to capture local detail features of the target more efficiently than the traditional C2f module while integrating a wider range of contextual information. This module not only improves computational efficiency but also enhances the model’s representational capacity through internal feature channel transformation and information compression. This allows the model to more comprehensively analyze the image content and accurately capture the details of small objects. As a result, the model’s mAP0.5 and mAP0.5:0.95 increased by approximately 1.1% and 0.6%, respectively, while the number of parameters and the model size were reduced by 6.7 M and 12.7 MB compared to the baseline, and then seeing an increase to 14 ms, still ensures a reasonable latency time.

These results validate our improvement strategy, which effectively enhances the accuracy of target detection in aerial images while ensuring that the model is lightweight, demonstrating the profound significance of the research.

Compared with the baseline model, MPE-YOLO shows a significant improvement in the detection accuracy of all categories. As shown in Table  3 , the accuracy of both the pedestrian and people categories is improved by more than 8 points, which indicates that the MPE-YOLO model has a strong detail capture ability for small-scale targets. Overall, the average accuracy of the MPE-YOLO model (mAP0.5) reached 37.0%, which is nearly 6% higher than that of YOLOv8, proving the effectiveness of MPE-YOLO.

Comparative experiments

To validate the effectiveness of the model. we selected the most popular object detection algorithms to compare with MPE-YOLO, including YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOX 41 , RT-DETR 42 , and Gold-YOLO 43 ,ASF-YOLO 44 , two of the recent research results, as shown in Table  4 .

The test results on the VisDrone2019 dataset show the differences in the performances of different object detection algorithms. First, we observed that the performances of the most classical YOLOv5s model was 26.8% on mAP0.5 and 7.0% on APs for small target detection. This result reflects the challenges of the basic YOLO model for small target detection on aerial image datasets. In comparison, YOLOv6s performed slightly worse, with mAP0.5 at 26.6% and APs at 6.7%, but despite this, the performances of the two methods were not very different. The model size and the number of parameters significantly differ, with the model size of YOLOv6s being nearly three times larger than that of YOLOv5s, and the number of parameters being more than doubled. YOLOX-s increased mAP0.5 to 29.5% and APs to 8.8%, indicating a significant improvement in the detection effect. However, this improvement comes at the cost of an increased model size (50.4 MB) and a larger number of parameters (8.9 M).

We then analyzed more advanced models - YOLOv8s and YOLOv8m. The YOLOv8s model achieves 31.3% on mAP0.5 and 8.2% APs, indicating that structural optimization has led to significant improvements. The YOLOv8m model achieves 35.4% and 9.8% on mAP0.5 and APs, respectively, which further proves that larger models may have better accuracy, especially for the more complex task of small object detection.

The RT-DETR-R18 model has a high score (35.9% vs. 10.2%) on both mAP0.5 and APs compared to the traditional architecture of the YOLO series, and it uses the DETR architecture, indicating the potential of the attention mechanism for more accurate object detection, and its model size and number of parameters are also lower than YOLOv8m.

To further validate the superiority of the MPE-YOLO model, we included two advanced models from existing literature, Gold-YOLO and X-YOLO, for comparative experiments. The experimental results show that Gold-YOLO achieved mAP0.5 and APs of 33.2 % and 9.5% respectively, with a model size of 26.3 MB and 13.4 million parameters. X-YOLO achieved mAP0.5 and APs of 34.0% and 9.6% respectively, with a model size of 22.8 MB and 11.3 million parameters. Both models showed significant improvements in performance and small object detection compared to the early YOLO series.

In the end, the MPE-YOLO model achieved the highest mAP0.5 of 37.0% and APs of 10.8%, while maintaining a model size of only 11.5 MB and 4.4 million parameters. This demonstrates that MPE-YOLO not only outperforms other current models in terms of performance but also achieves low resource consumption through its lightweight design, making it highly practical and attractive for real-world applications.

Visual analytics

figure 6

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the RSOD VisDrone dataset.

By carefully selecting image samples, we applied the baseline model and the MPE-YOLO model for object detection. This allowed us to compare and analyze the detection performances of the two models. As shown in Fig.  6 , the detection confidence of the MPE-YOLO model is significantly better than that of the baseline model under multiple scenarios and challenging conditions. This is manifested in the fact that the target bounding boxes it identifies have higher confidence scores, and these scores are more consistent with the actual target. More importantly, MPE-YOLO also shows significant improvements in reducing false positives and false negatives, accurately identifying and identifying most targets, while minimizing misidentification of non-target areas. Moreover, even under suboptimal shading or lighting conditions, MPE-YOLO achieved a low missed detection rate. These comparison results highlight the effectiveness of the enhanced feature extraction network in MPE-YOLO in dealing with overlapping, size changes and complex backgrounds between targets, indicating that it has more robust feature learning and more accurate target prediction capabilities.

figure 7

In Fig.  7 , the improved MPE-YOLO model demonstrates its superior feature extraction and targeting capabilities. This is evident by the more concentrated and reinforced high-response regions it reflects. This feature is presented as a brighter area on the heat map, closely following the actual position and contour of the target, demonstrating that the MPE-YOLO model can effectively focus on important signals. In addition, compared with the baseline model, the heat map generated by the improved model shows fewer scattered hot spots around the target, which reduces the possibility of false detection and false alarms, demonstrating the precision and robustness of MPE-YOLO in small target detection tasks. First, the heat map of the night scene in the first row reveals the recognition ability of MPE-YOLO under low-light conditions, in which areas with strong brightness are accurately mapped to the target location, indicating that the model still has efficient feature capture capabilities at low lighting levels. Then, in the second row, when faced with a complex background scene, the heat map generated by MPE-YOLO maintained the ability to accurately identify the target without being affected by the complex environment. The model’s clear positioning of the target verifies its effectiveness in distinguishing the target from the cluttered background in the actual environment. Finally, in the case of dense small targets in the third row, the MPE-YOLO heat map shows excellent discrimination, even when the targets are very close to each other. The highlights of the heat map correspond densely and distinctly to the contours of each small target, showing the model’s ability to accurately locate multiple targets.

These visual evidences are consistent with the increase in mAP0.5 and mAP0.5:0.95 in the experiment, which provides intuitive and strong support for our research.

figure 8

Relationships between the AP50:95 and model parameter count for different models.

Figure  8 shows the relationship between mAP0.5:0.95 and the parameters of each model, where the x-axis represents the parameters of the model and the y-axis represents the detection performance index. As can be seen from the figure, MPE-YOLO achieves an improvement in detection accuracy while maintaining a low weight. Compared to all the comparison models, our model is best suited for drone vehicle inspection tasks.

Generalization study

Through comprehensive comparative tests on two different remote sensing image datasets RSOD and AI-TOD in Table  5 , our MPE-YOLO model demonstrates its superior generalizability. According to these tests, the MPE-YOLO model showed high accuracy in the two key performance indicators of mAP0.5 and mAP0.5:0.95 compared with several existing advanced object detection models, especially on the AI-TOD dataset, for which the average target size was only 12.8 pixels.

The experimental results reveal the strong detection ability of MPE-YOLO, which maintains high accuracy even in small target detection scenarios, confirming its practicability and effectiveness in the field of remote sensing image analysis. These conclusions support the use of the MPE-YOLO model as a remote sensing target detection algorithm with strong adaptability and generalizability, and indicate its broad potential for future practical applications.

figure 9

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the RSOD dataset.

figure 10

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the AI-TOD dataset.

To more clearly demonstrate the strength of our algorithm in detecting small-sized targets, we selected several representative photographs from both the RSOD and AI-TOD datasets. Figures 9 and 10 show that YOLOv8 has a great number of missed detections on smaller targets than MPE-YOLO, which has significantly fewer missed cases. Additionally, MPE-YOLO shows a general improvement in detection precision. These comparative visuals underscore that MPE-YOLO is a more suitable model for practical detection in aerial imagery applications.

Upon examining these sets of illustrations, it becomes evident that our MPE-YOLO outperforms YOLOv8, especially in scenarios with smaller and easily overlooked targets, reinforcing its efficacy and reliability for deployment in aerial target detection tasks.

Conclusions

In this study, we propose the MPE-YOLO model, which effectively improves the accuracy of small and medium-sized object detection in aerial images, and optimizes the object detection performance in complex environments. First, the MFI module is proposed to effectively improve the efficiency of feature fusion, reduce information loss, and qualitatively improve the detection characteristics of small targets. The PEC module enhances the ability of the network to capture the detailed features of the target, which has a significant effect on the object detection in complex backgrounds. The ES-C2f module further strengthens the feature representation ability of small targets by optimizing the sensing range. The model has been tested on multiple aerial image datasets to confirm its excellent performance, especially in terms of real-time processing power and detection accuracy. Future work will focus on improving the generalization ability of the model and optimizing the operational efficiency, with a view to deploying it in a wider range of practical applications.

Data availability

All the images and experimental test images in this paper were from the open source VisDrone dataset, RSOD dataset and AI-TOD dataset. These datasets analyzed during the current research period can be found at the following website.Visdrone: (https://github.com/VisDrone/VisDrone-Dataset), RSOD: (https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-) and AI-TOD :( https://github.com/jwwangchn/AI-TOD).

Liu, H. et al. Improved gbs-yolov5 algorithm based on yolov5 applied to uav intelligent traffic. Sci. Rep. 13 , 9577 (2023).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Bravo, R. Z. B., Leiras, A. & CyrinoOliveira, F. L. The use of uav s in humanitarian relief. An application of pomdp-based methodology for finding victims. Prod. Oper. Manag. 28 , 421–440 (2019).

Article   Google Scholar  

Suthaharan, S. & Suthaharan, S. Support vector machine. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning 207–235 (2016).

Biau, G. & Scornet, E. A random forest guided tour. TEST 25 , 197–227 (2016).

Article   MathSciNet   Google Scholar  

Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) Vol. 1, 886–893 (IEEE, 2005).

Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 580–587 (2014).

Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision 1440–1448 (2015).

Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. 28 (2015).

Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 21–37 (Springer, 2016).

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 779–788 (2016).

Redmon, J. & Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6517–6525 (2017).

Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).

Glenn, J. Ultralytics yolov5 (2022).

Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).

Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7464–7475 (2023).

Glenn, J. Ultralytics yolov8 (2023).

Cheng, G., Si, Y., Hong, H., Yao, X. & Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 18 , 431–435 (2020).

Article   ADS   Google Scholar  

Guo, J. et al. A new detection algorithm for alien intrusion on highway. Sci. Rep. 13 , 10667 (2023).

Sahin, O. & Ozer, S. Yolodrone: Improved yolo architecture for object detection in drone images. In 2021 44th International Conference on Telecommunications and Signal Processing (TSP) , 361–365 (IEEE, 2021).

Chen, Y., Zheng, W., Zhao, Y., Song, T. H. & Shin, H. Dw-yolo: An efficient object detector for drones and self-driving vehicles. Arab. J. Sci. Eng. 48 , 1427–1436 (2023).

Zhu, X., Lyu, S., Wang, X. & Zhao, Q. Tph-yolov5: Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2778–2788 (2021).

Yang, Y. Drone-view object detection based on the improved yolov5. In 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA) 612–617 (IEEE, 2022).

Lin, Y., Li, J., Shen, S., Wang, H. & Zhou, H. In GDRS-YOLO: More Efficient Multiscale Features Fusion Object Detector for Remote Sensing Images 21 , 1–5 (2024).

Jin, R., Jia, Z., Yin, X., Niu, Y. & Qi., Y. In Domain Feature Decomposition for Efficient Object Detection in Aerial Images Vol. 16, 1626 (2024).

Bai, Z., Liu, Z., Li, G., Ye, L. & Wang, Y. Circular Complement Network for RGB-D Salient Object Detection Vol. 451, 95–106 (Elsevier, 2021).

Google Scholar  

Pacal, I. et al. In An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets 141 , 105031 (2022).

Xu, J., Ren, H., Cai, S. & Zhang, X. An Improved faster R-CNN Algorithm for Assisted Detection of Lung Nodules Vol. 153, 106470 (Elsevier, 2023).

Chen, X., Zheng, H., Tang, H. & Li, F. Multi-Scale Perceptual YOLO for Automatic Detection of Clue Cells and Trichomonas in Fluorescence Microscopic Images 108500 (Elsevier, 2024).

Zhu, H. et al. Development of a PWM Precision Spraying Controller for Unmanned Aerial Vehicles Vol. 7, 276–283 (Elsevier, 2010).

Wang, M., Zang, X.-Z., Fan, J.-Z. & Zhao, J. Biological Jumping Mechanism Analysis and Modeling for Frog Robot Vol. 5, 181–188 (Elsevier, 2008).

Nishiyama, T., Kumagai, A., Kamiya, K. & Takahashi, K. Silu: Strategy involving large-scale unlabeled logs for improving malware detector. In 2020 IEEE Symposium on Computers and Communications (ISCC) 1–7 (IEEE, 2020).

He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 , 1904–1916 (2015).

Article   PubMed   Google Scholar  

Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2117–2125 (2017).

Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8759–8768 (2018).

Gao, S.-H. et al. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43 , 652–662 (2019).

Luo, W., Li, Y., Urtasun, R. & Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 29 (2016).

Du, D. et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF international conference on computer vision workshops (2019).

Long, Y., Gong, Y., Xiao, Z. & Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 55 , 2486–2498 (2017).

Wang, J., Yang, W., Guo, H., Zhang, R. & Xia, G.-S. Tiny object detection in aerial images. In 2020 25th International Conference on Pattern Recognition (ICPR) 3791–3798 (IEEE, 2021).

Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).

Lv, W. et al. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069 (2023).

Wang, C. et al. Gold-yolo: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inform. Process. Syst. 36 (2024).

Kang, M., Ting, C.-M., Ting, F. & Phan, R. Asf-yolo: A novel yolo model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 147 , 105057 (2024).

Download references

Acknowledgements

This work was supported by a Grant from the National Natural Science Foundation of China (No.62105093)

Author information

Authors and affiliations.

College of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang, 050018, China

Jia Su, Yichang Qin, Ze Jia & Ben Liang

You can also search for this author in PubMed   Google Scholar

Contributions

J.S. conceived the experiments, J.S. and Y.Q. conducted the experiments, Z.J. and B.L. analysed the results. Y.Q. wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yichang Qin .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Su, J., Qin, Y., Jia, Z. et al. MPE-YOLO: enhanced small target detection in aerial imaging. Sci Rep 14 , 17799 (2024). https://doi.org/10.1038/s41598-024-68934-2

Download citation

Received : 29 February 2024

Accepted : 30 July 2024

Published : 01 August 2024

DOI : https://doi.org/10.1038/s41598-024-68934-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Object detection
  • Aerial image
  • Small target
  • Model lightweight

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on object detection using tensorflow

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

jmse-logo

Article Menu

research paper on object detection using tensorflow

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Real time vessel detection model using deep learning algorithms for controlling a barrier system.

research paper on object detection using tensorflow

1. Introduction

  • Camera Calibration: We need to calibrate the camera to obtain the intrinsic parameters (focal length and principal point) and distortion coefficients. This step is crucial for accurate measurements in real-world units.
  • Object Detection: This is the focus of this paper. We detect and track the ship in each frame. The model provides us with bounding box coordinates for the ship.
  • Distance Measurement: We determine the real-world distance between the camera and a reference point on the ship. This information comes from a lidar sensor.
  • Speed Calculation: We calculate the speed of the ship using the change in position of the ship over time. The speed ( v ) can be calculated using the formula v = ∆ d ∆ t , where Δ d is the change in distance, and Δ t is the change in time between consecutive frames.
  • Frame Rate: We consider the frame rate of the camera ( f frame ) when calculating the time difference between frames. The time difference ( Δt ) can be calculated as ∆ t = 1 f f r a m e .

2. Literature Review

  • Architecture search: YOLO-NAS uses neural architecture search to find the most effective architecture for the task of object detection.
  • Efficiency: YOLO-NAS aims to find a network architecture that is both accurate and computationally efficient.

3. Implementation of YOLOv5 and YOLOv8

3.1. tools and environment used, 3.2. dataset labeling, 3.3. data training, validation, and testing split.

  • Flip—horizontal ( Figure 7 );
  • Grayscale—applied to 15% of images ( Figure 7 ).
  • Blur—up to 1.25 px ( Figure 8 );
  • Noise—up to 3% pixels ( Figure 8 );
  • Shear—±10° horizontal, ±10° vertical ( Figure 9 );
  • Brightness—between −25% and +25% ( Figure 9 );

3.4. Model Training

4. results and discussion.

  • Architectural Changes: YOLOv8 introduced architectural changes, such as using CSPDarknet53 as the backbone. This architecture impacts the model’s ability to capture features and representations, leading to improved performance. In this work, we are interested in capturing features of the ship’s superstructure that help identify the type of ship. However, a lightweight improved YOLOv5 is proposed in [ 34 ] for real-time localization of fruits, using the bneck module of MobileNetV3 instead of CSPDarknet53. In this study, the modified YOLOv5 reached the mAP of 0.969.
  • Training Techniques: YOLOv8 incorporated training techniques, such as the use of CIOU (complete intersection over union) loss and focal loss. These techniques are aimed at improving the accuracy of object detection during the training process.
  • Scales for Flexibility: YOLOv8 provides different scales (S, M, L, and X). This allows us to choose the scale that best fits our requirements. This adaptability was beneficial for customizing the models to the specific task of detecting ship profiles.
  • Task Adaptability: YOLOv8 is designed to be adaptable to various object detection tasks, including custom applications. The architecture allows us to train models on our specific datasets.

5. Conclusions

Author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Fang, S.; Mu, L.; Jia, S.; Liu, K.; Liu, D. Research on sunken & submerged oil detection and its behavior process under the action of breaking waves based on YOLO v4 algorithm. Mar. Pollut. Bull. 2022 , 179 , 113682. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Barbero-García, I.; Kuschnerus, M.; Vos, S.; Lindenbergh, R. Automatic detection of bulldozer-induced changes on a sandy beach from video using YOLO algorithm. Int. J. Appl. Earth Obs. Geoinf. 2023 , 117 , 103185. [ Google Scholar ] [ CrossRef ]
  • Zhang, L.; Chen, P.; Li, M.; Chen, L.; Mou, J. A data-driven approach for ship-bridge collision candidate detection in bridge waterway. Ocean Eng. 2022 , 266 , 113137. [ Google Scholar ] [ CrossRef ]
  • Wang, L.; Dong, Y.; Fei, C.; Liu, J.; Fan, S.; Liu, Y.; Li, Y.; Liu, Z.; Zhao, X. A lightweight CNN for multi-source infrared ship detection from unmanned marine vehicles. Heliyon 2024 , 10 , E26229. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018 , 145 , 3–22. [ Google Scholar ] [ CrossRef ]
  • Li, B.; Xie, X.; Wei, X.; Tang, W. Ship detection and classification from optical remote sensing images: A survey. Chin. J. Aeronaut. 2021 , 34 , 145–163. [ Google Scholar ] [ CrossRef ]
  • Madjidi, H.; Laroussi, T. Approximate MLE based automatic bilateral censoring CFAR ship detection for complex scenes of log-normal sea clutter in SAR imagery. Digit. Signal Process. Rev. J. 2023 , 136 , 103972. [ Google Scholar ] [ CrossRef ]
  • Lou, X.; Liu, Y.; Xiong, Z.; Wang, H. Generative knowledge transfer for ship detection in SAR images. Comput. Electr. Eng. 2022 , 101 , 108041. [ Google Scholar ] [ CrossRef ]
  • Yu, C.; Shin, Y. SAR ship detection based on improved YOLOv5 and BiFPN. ICT Express 2023 , 10 , 28–33. [ Google Scholar ] [ CrossRef ]
  • Xiong, B.; Sun, Z.; Wang, J.; Leng, X.; Ji, K. A Lightweight Model for Ship Detection and Recognition in Complex-Scene SAR Images. Remote Sens. 2022 , 14 , 6053. [ Google Scholar ] [ CrossRef ]
  • Chen, Z.; Chen, D.; Zhang, Y.; Cheng, X.; Zhang, M.; Wu, C. Deep learning for autonomous ship-oriented small ship detection. Saf. Sci. 2020 , 130 , 104812. [ Google Scholar ] [ CrossRef ]
  • Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. In Procedia Computer Science ; Elsevier B.V.: Amsterdam, The Netherlands, 2021; pp. 1066–1073. [ Google Scholar ] [ CrossRef ]
  • Bakhshi, A.; Chalup, S.; Noman, N. Fast Evolution of CNN Architecture for Image Classification. In Deep Neural Evolution: Deep Learning with Evolutionary Computation ; Iba, H., Noman, N., Eds.; Springer: Singapore, 2020; pp. 209–229. [ Google Scholar ] [ CrossRef ]
  • Yang, Z.; Nevatia, R. A Multi-Scale Cascade Fully Convolutional Network Face Detector. September 2016. Available online: http://arxiv.org/abs/1609.03536 (accessed on 4 April 2023).
  • Tomè, D.; Monti, F.; Baroffio, L.; Bondi, L.; Tagliasacchi, M.; Tubaro, S. Deep Convolutional Neural Networks for pedestrian detection. Signal Process. Image Commun. 2016 , 47 , 482–489. [ Google Scholar ] [ CrossRef ]
  • Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Subcategory-Aware Convolutional Neural Networks for Object Proposals and Detection. April 2016. Available online: http://arxiv.org/abs/1604.04693 (accessed on 4 April 2023).
  • Wan, J.; Wang, D.; Hoi, S.C.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 157–166. [ Google Scholar ] [ CrossRef ]
  • Wu, Z.; Wang, X.; Jiang, Y.-G.; Ye, H.; Xue, X. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification. April 2015. Available online: http://arxiv.org/abs/1504.01561 (accessed on 15 May 2023).
  • Zhang, Y.; Sohn, K.; Villegas, R.; Pan, G.; Lee, H. Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015. [ Google Scholar ] [ CrossRef ]
  • Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VII 13 ; Springer: Cham, Switzerland, 2014. [ Google Scholar ]
  • Liu, B.; Zhao, W.; Sun, Q. Study Of Object Detection Based On Faster R-CNN. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017. [ Google Scholar ]
  • Chen, X.; Gupta, A. An Implementation of Faster RCNN with Study for Region Sampling. February 2017. Available online: http://arxiv.org/abs/1702.02138 (accessed on 24 May 2023).
  • Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. June 2015. Available online: http://arxiv.org/abs/1506.02640 (accessed on 10 February 2023).
  • Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. December 2016. Available online: http://arxiv.org/abs/1612.08242 (accessed on 10 February 2023).
  • Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. April 2020. Available online: http://arxiv.org/abs/2004.10934 (accessed on 30 June 2023).
  • Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. DenseNet: Implementing Efficient ConvNet Descriptor Pyramids. April 2014. Available online: http://arxiv.org/abs/1404.1869 (accessed on 24 April 2023).
  • Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An Effective and Efficient Implementation of Object Detector. July 2020. Available online: http://arxiv.org/abs/2007.12099 (accessed on 5 September 2023).
  • Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. ultralytics/yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation ; Zenodo: Genève, Switzerland, 2022. [ Google Scholar ] [ CrossRef ]
  • Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. 2021. Available online: https://github.com/ultralytics/yolov3 (accessed on 4 April 2023).
  • Chen, D.; Shen, H.; Shen, Y. PT-NAS: Designing efficient keypoint-based object detectors for desktop CPU platforms. Neurocomputing 2022 , 476 , 38–52. [ Google Scholar ] [ CrossRef ]
  • Chen, J.; Chen, H.; Xu, F.; Lin, M.; Zhang, D.; Zhang, L. Real-time detection of mature table grapes using ESP-YOLO network on embedded platforms. Biosyst. Eng. 2024 , 246 , 122–134. [ Google Scholar ] [ CrossRef ]
  • Liu, X.; Wang, T.; Yang, J.; Tang, C.; Lv, J. MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment. Neurocomputing 2024 , 574 , 127210. [ Google Scholar ] [ CrossRef ]
  • Escorcia-Gutierrez, J.; Gamarra, M.; Beleño, K.; Soto, C.; Mansour, R.F. Intelligent deep learning-enabled autonomous small ship detection and classification model. Comput. Electr. Eng. 2022 , 100 , 107871. [ Google Scholar ] [ CrossRef ]
  • Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight tomato real-time detection method based on improved YOLO and mobile deployment. Comput. Electron. Agric. 2023 , 205 , 107625. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

HYPERPARAMETERSUOMYOLOv8YOLOv5
mAP@50%98.995.2
mAP@50-95%78.864.0
Precision%0.9390.860
Recall%0.8950.892
F1 score (accuracy)%0.9160.876
Weight decay-0.00050.0005
Mini-batch size-1616
Training timehours0.370.533
Inference speed per imagems11.411.8
Frames per secondfps87.71984.745
GPU memory usageGb7.481.99
MethodPrecision (%)Recall (%)mAP (%)
Faster R-CNN (Yu and Shin)78.4981.4279.33
RetinaNet (Yu and Shin)82.1683.3581.14
YOLOv5 (Yu and Shin)83.0284.9782.26
YOLOv5 (ours)86.089.295.2
YOLOv8 (ours)93.789.598.9
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Folarin, A.; Munin-Doce, A.; Ferreno-Gonzalez, S.; Ciriano-Palacios, J.M.; Diaz-Casas, V. Real Time Vessel Detection Model Using Deep Learning Algorithms for Controlling a Barrier System. J. Mar. Sci. Eng. 2024 , 12 , 1363. https://doi.org/10.3390/jmse12081363

Folarin A, Munin-Doce A, Ferreno-Gonzalez S, Ciriano-Palacios JM, Diaz-Casas V. Real Time Vessel Detection Model Using Deep Learning Algorithms for Controlling a Barrier System. Journal of Marine Science and Engineering . 2024; 12(8):1363. https://doi.org/10.3390/jmse12081363

Folarin, Abisade, Alicia Munin-Doce, Sara Ferreno-Gonzalez, Jose Manuel Ciriano-Palacios, and Vicente Diaz-Casas. 2024. "Real Time Vessel Detection Model Using Deep Learning Algorithms for Controlling a Barrier System" Journal of Marine Science and Engineering 12, no. 8: 1363. https://doi.org/10.3390/jmse12081363

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

A comprehensive survey of deep learning-based lightweight object detection models for edge devices

  • Open access
  • Published: 10 August 2024
  • Volume 57 , article number  242 , ( 2024 )

Cite this article

You have full access to this open access article

research paper on object detection using tensorflow

  • Payal Mittal 1  

This study concentrates on deep learning-based lightweight object detection models on edge devices. Designing such lightweight object recognition models is more difficult than ever due to the growing demand for accurate, quick, and low-latency models for various edge devices. The most recent deep learning-based lightweight object detection methods are comprehensively described in this work. Information on the lightweight backbone architectures used by these object detectors has been listed. The training and inference processes concerning to deep learning applications on edge devices is being discussed. To raise readers’ awareness of this developing domain, a variety of applications for deep learning-based lightweight object detectors and related utilities have been offered. Designing potent, lightweight object detectors based on deep learning has been suggested as a counter to such problems. On well-known datasets such as MS-COCO and PASCAL-VOC, we thoroughly examine the performance of certain conventional deep learning-based lightweight object detectors.

Avoid common mistakes on your manuscript.

1 Introduction

The advancement of effective deep learning-based object detectors has been influenced by Internet of Things (IoT)-based technologies. The majority of deep object models demand too much Central Processing Unit (CPU) power and cannot be used on edge devices, despite the fact that many object detectors attain outstanding accuracy and carry out inference in real-time (Wang et al. 2021a , 2021b , 2021c , 2022 ). Exciting outcomes have already been achieved using a variety of strategies. The brief study of strategies to deployment of deep learning-based applications into edge devices include (Wang et al. 2020a , 2020b , 2020c , 2021a , 2021b , 2021c ; Véstias et al. 2020 ; Li and Ye 2023 ; Subedi et al. 2021 ):

Using a partitioning technique, since various layers may execute at different times. For example, in a fully connected or convolutional layer, divide the processing graph into offloadable tasks so that the execution time of each composite task unit is the same.

Large-scale analytics platforms require intermediate resource standardisation for data manageability and low latency, as opposed to standalone applications on mobile devices. With the provisioning of intermediate resources, deep learning-based analytics platform can determine the proportion of local processing, provided that there is a mechanism to divide the load between buffering and memory loading. The offloaded execution through efficient partitioning can reduces costs, latency, or any other issue-related aim.

Moreover, a detailed study is provided in Sect. 4.6 of manuscript. In recent years, a new field of study i.e., lightweight object detectors have emerged with the goal of developing compact, effective networks for deployments of the IoT that frequently take place in low computing or resource-constrained settings. The research community has long worked to identify the best accuracy detection models through advanced architectural searches, as developing the deep learning-based lightweight network architecture is a difficult procedure. When using these models in edge devices, such as high-performance embedded processors, the question arises regarding usage of high-end innovative applications with fewer resources. It is still not entirely possible to perform detection using a smart phone or edge devices. Although existing models available today are capable of doing this task, but their precision level is just insufficient and undesirable in real-time instances.

Edge computing, according to Gartner, is a component of an architecture of distributed computing where data processing resides near the edge where devices or individuals generate or consume that data (Hua et al. 2023 ). Because of the constant growth in data created by the IoT, edge computing was first allocated to reduce bandwidth costs for data travelling long distances. On the other hand, the emergence of real-time applications that require processing at the edge is driving the current technological advancements. Among many other benefits, data minimization at the network level can prevent bottlenecks and significantly reduce energy, bandwidth, and storage expenses. A single device is able to send data across a network, problems occur when hundreds of devices send data at once. In addition to reducing quality due to delay, it also raises bandwidth expenses and creates bottlenecks that might result in cost spikes. By acting as a local source for these systems’ data processing and storage, edge computing services and offerings assist in fixing this problem. It also serves as an edge gateway, minimizing bandwidth requirements by processing data from an edge device and sending the pertinent data back through the cloud (Jin et al. 2021 ). A key element in modern integrated real-world Artificial Intelligence (AI) systems is edge devices. IoT devices could only gather data in the beginning and send it to the cloud for processing. By putting services closer to a network’s edge, edge computing expands the possibilities of cloud computing and enables a wider range of AI services and machine learning applications. IoT computing devices, mobile devices, embedded computers, smart TVs, and other connected gadgets can all be termed edge devices. Real-time application development and deployment can be accelerated by edge computing devices through high-speed networking technologies such 5G networking. Robotics, image and video processing, intelligent video analytics, self-driving cars, medical imaging, machine vision, industrial inspection, among examples of such applications (Véstias et al. 2020 ).

Edge computing can be applied to devices that are directly connected to sensors, routers or gateways that transfer data, or small servers installed locally in a closet. There are an increasing number of edge computing use cases as well as smart devices capable of doing various activities at the edge. The range of applications for edge computing is expanding in tandem with the development of AI capabilities. The applications spanning a wide range can be found utilising edge computing (Xu et al. 2020 ). Additionally, there is a good deal of overlap among the various use cases for edge computing. In particular, edge computing functionality in traffic management systems is closely related to that of autonomous vehicles as briefly discussed below:

Industrial infrastructure

Predictive maintenance and failure detection management in industries are supported by the edge computing. When a machine or component breaks down, the capability kicks in, enabling factory workers to fix the issue or replace the part in advance and save money by preventing lost output. The architecture of edge computing can handle large amounts of data from sensors and programmable logic controllers, as well as facilitate effective communications across extremely complicated supervisory control and data gathering systems.

Huge amounts of data are produced by retail applications from different point-of-sale systems, item stocking procedures, and other company operations. Edge computing can assist in analysing this vast quantity of data and locating problems that require quick resolution. Additionally, edge computing provides a way to handle consumer data locally, preventing it from leaving the client’s residence, a privacy regulation problem that is becoming more pressing.

In order to give medical practitioners precise, timely information about a patient’s status, the healthcare and medical industries gather patient data from sensors, monitors, wearable technology, and other devices. Edge computing solutions can provide dashboards with such data so users can see all the key indications in one convenient place. AI-enabled edge computing solutions can recognise anomalous data, allowing medical personnel to respond to patient requirements quickly and with the minimal possible false alarms. Furthermore, edge computing devices can aid in addressing concerns related to patient confidentiality and data privacy by processing data locally.

Global energy

Cities and smart grid systems can monitor public buildings and facilities for improved energy efficiency in areas like lighting, heating, and clean energy use by using edge computing devices. As an illustration: edge computing devices are utilised by intelligent lighting controls to regulate individual lights for optimal efficiency and public space safety; Embedded edge computer devices are used in solar fields to detect changes in the weather and modify their position; Edge computing is used by wind farms to send sensor data to substations and link to cell towers.

Public transit systems

Only the data necessary to support in-car activities and dispatcher insights in public transportation applications can be collected and transmitted by edge computing systems deployed in buses, passenger rail systems, and paratransit vehicles.

Travel transport utilities

In order to increase convenience and safety, edge computing can control when traffic signals turn on and off, open and close additional lanes of traffic, make sure that communications are maintained in the event of a public emergency, and do other real-time tasks. The adoption of autonomous vehicles will be significantly influenced by sophisticated traffic management systems, as was previously indicated.

Advanced industries

In advanced industries, vehicle sensors and cameras can provide data to edge computing devices, which make choices in milliseconds without any latency. This fast decision making is necessary in autonomous vehicles, for safety reasons. Self-parking apps and lane-departure warning are two examples of edge computing services that are currently readily accessible. Furthermore, as more cars are able to communicate with their surroundings, a quick and responsive network will be required. In order to assist predictive maintenance, electric vehicles require constant monitoring. Edge computing can be used to manage data in this regard. Data aggregation is supported by edge computing, which reports actionable data for maintenance and performance. These above-mentioned multitude of industries investing in implacability of edge devices. These industries include travel, transport and logistics, cross-vertical, retail, public sector utilities, global energy and materials, banking insurance, infrastructure and agriculture etc. Their share representation with respect to employability in various edge computing devices is shown in Fig. 1 a (Chabas et al. 2018 ). The travel, transport and logistics holds the maximum share of 24.2%, then 13.1% in global energy markets, 10.1% in retail and advanced industries followed by less shares by other industries. We have also represented hardware costs comparisons in terms of minimum and maximum cost in case of edge computing devices for mentioned industries. The hardware value includes opportunity across the tech stack on the basis of sensors, on-device firmware, storage and processor. By 2025, the edge computing-based devices depicts $175 to $215 billion potential hardware value. The industries such as travel, transport and logistics approximate $35 to $43, cross-vertical estimated to be $32 to $40 billion, $20 to $28 billion in retail sector, $16 to $24 billion in public sector utilities, $9 to $17 billion in global energy and materials, $4 to $11 billion in infrastructure and agriculture as depicted in Fig. 1 b (Chabas et al. 2018 ). There is a dire need to focus on advancing development of lightweight object detection models to boost their employability in heterogeneous edge devices. This survey study analyses the state-of-the-art deep learning-based lightweight object identification models in order to attain excellent performance on edge devices. With equivalent accuracy, powerful lightweight object detection models offer these advantages (Kim et al. 2023 ; Huang et al. 2022 ):

Lightweight object detection models based on deep learning require less communication between edge distributed training.

Less bandwidth will be needed to export a cutting-edge detection model from the cloud to a particular application.

Deploying lightweight detectors on Field Programmable Gate Arrays (FPGAs) and other hardware with limited resources is more practical.

figure 1

a Share representation of various industries embedded in edge computing devices. b Comparison of hardware costs in case of edge computing devices

1.1 Motivation

Object detection is the core concept in deploying innovative edge devices-based applications such as face detection (Li et al. 2015 ), objects tracking (Nguyen et al. 2019 ), video surveillance (Yang and Rothkrantz 2011 ), pedestrian detection (Brunetti et al. 2018 ), etc. The powerful capabilities of deep learning boost the performance of object detection in these applications. The generic deep learning-based object detection models have computational complexities such as extensive use of platform resources, more bandwidth, and large data processing pipelines (Jiao et al. 2019 ; Zhao et al. 2019 ). However, a detection network might potentially use three orders of magnitude more Floating Point Operations (FLOPs) than a classification network due to the computational complexity, making its deployment on an edge device much more difficult (Ren et al. 2018 ). The generic deep object detectors often use more network layers which eventually require high parameter tuning. Deep models have more network layers, which makes it harder for the network to detect small targets because they lose position and feature information over time. The network parameters being too large could damage the model’s effectiveness and make it challenging to implement on smart mobile terminals, which brings us to our final possible concern.

For the development of lightweight object detection on edge devices, a comprehensive assessment of the research directions related to this topic is necessary, particularly for researchers who are interested in pursuing this line of inquiry. To assess the usefulness of deep learning-based lightweight object detection on edge devices, more research is required than just a basic review of the literature. Because the proposed research can offer a comprehensive examination of the literature, it can achieve each of these objectives. A deep learning-based lightweight detection evaluation hasn’t been written about recently in the literature. There are generic and application specific surveys dedicated to deep learning-based object detectors (Jiao et al. 2019 ; Zou et al. 2023 ; Liu et al. 2020a , 2020b , 2020c , 2020d ; Mittal et al. 2020 ; Han et al. 2018 ; Zhou et al. 2021a , 2021b ) but not have consolidated study specifically for lightweight detectors for edge devices as mentioned in Table  1 . To raise readers’ understanding of this developing subject, deep learning-based lightweight object detectors on edge devices have been investigated in this work. The research of deep learning-based lightweight object identification models with regard to various backbone architectures and diverse applications on edge devices will be advanced by the release of this study. The key objectives of the survey are as follows:

To provide taxonomy of deep learning-based lightweight object detection algorithms on edge devices

To provide an analysis of deep learning-based lightweight backbone architectures for edge devices

Literature findings of applications deployed through lightweight object detectors

Comparison of lightweight object detectors by analyzing results on leading detection datasets

The organization of research paper is as follows: Sect.  2 elaborates the work related to development of deep learning-based object detectors. The deep learning-based object detectors have further categorized into two, one and advanced stage. Section  3 describes materials and methods required for deep learning-based lightweight detection models on edge devices. The architectural details related to training and inference lightweight models have also been mentioned in this section. Further, detailed crucial properties and performance milestone of lightweight object detection methods have been mentioned in this section. Section  4 discusses commonly utilized backbone architectures in deep learning-based lightweight object detection models. Further, applications of lightweight object detection models have also been mentioned. The recommendations for designing powerful deep learning-based lightweight model are provided in Sect.  4 . The final section brings the entire study to a close and outlines some crucial implications for more research.

2 Background

Recent developments in the field of deep learning-based object detectors have mostly concentrated on raising the benchmark datasets’ state-of-the-art accuracy, which has caused an explosion in model size and parameters. The research, on the other hand, has demonstrated interest in suggesting lighter, smaller, and smarter networks that would minimise the parameters while keeping cutting-edge performance (Nguyen et al. 2023 ). In the next section, we will provide a brief summary regarding categorization of generic object detection models.

2.1 Taxonomy of deep learning-based object detectors

During the last years, there has been a rapid and successful expansion in lightweight object detection research domain. This domain has exploded from adopting and familiarizing the latest machine and deep methods through development of new representations. The generic deep learning-based object detection models have been classified into two, one and, advanced stage each having different concepts.

2.1.1 Two-stage object detection models

Two-stage algorithms, having two different stages of region proposal and detection head. The first stage was for the calculation of RoI proposals using anchors in external region proposal techniques such as Edge Box (Zitnick and Dollár 2014 ) or Selective Search (Uijlings et al. 2013 ). The second stage consists of processing extracted RoIs into final bounding boxes, coordinate values and class labels. The examples of two-stage algorithms include Faster RCNN (Ren et al. 2015 ), Cascade RCNN (Cai and Vasconcelos 2018 ), R-FCN (Dai et. al. 2016 ) etc. The advantages of two-stage object detectors include better analysis of objects through given stages, multi-stage architecture to regress the bounding box values efficiently and better handling of class imbalance in datasets. Two-stage detectors adopted a deep neural-based Region Proposal Network (RPN) and a detection head. Even if the existing Light-Head R-CNN (Li et al. 2017 ) used a lightweight detection head, the backbone and detection part become imbalanced when the detection part is combined with a small backbone. This mismatch increases the danger of overfitting and causes repetitive calculation.

2.1.2 One-stage object detection models

Two-stage detectors helped deep learning-based object detection get off to a good start, but these systems struggled with speed. Due to their flexibility in satisfying demands like fast speed and minimal memory needs, one-stage detectors were ultimately adopted by researchers. The region proposal stage of two-stage detectors was eliminated by the one-stage algorithms since they saw the object identification problem as a regression problem. Instead of sending portions of the image to a fixed grid-based CNN, the entire image is sent at once, and anchors assist in identifying specific region suggestions. For the purpose of detecting the given area in a picture, boundary box coordinates were included. The examples of one-stage detectors include YOLO (Redmon et al. 2016 ), SSD (Liu et al. 2016 ), RetinaNet (Lin et al. 2017a , 2017b ) etc. The YOLO series outperforms two-stage models in terms of efficiency and accuracy.

2.1.3 Advanced-stage object detection models

The recently emerged advanced-stage object detectors removed the anchors concept in one-stage detectors for detecting objects. The advanced detector, CornerNet (Law and Deng 2018 ) detected objects as paired key points and a new corner pooling layer was introduced to better localize corners. CenterNet (Duan et al. 2019 ) detected the object as a triplet, rather than a pair of key points. Foveabox (Kong et al. 2020a , 2020b ) predicted category-sensitive semantic maps and category-agnostic bounding box for the object. The advanced-stage detectors also found struggling in locating multiple targets having small-size, complex backgrounds and slow detection speed. The one-stage methods (Bochkovskiy et al. 2020 ; Qin et al. 2019 ) utilized predefined anchor boxes and anchor-free (Duan et al. 2019 ) concepts for predicting bounding boxes.

2.1.4 Light-weight object detection models

The low computation in terms of bandwidth and resource utilization are light-weight object detectors and few examples include ThunderNet (Qin et al. 2019 ), PP-YOLO (Long et al. 2020a , 2020b ), YOLObile (Cai et al. 2021 ), Trident-YOLO (Wang et al. 2022a , 2022b , 2022c , 2022d ), YOLOV4-tiny (Jiang et al. 2020 ), Trident FPN (Picron and Tuytelaars 2021 ) etc.

The deep learning-based object detection algorithms have been categorized into two, one, advanced-stage and light weight detectors are highlighted in Fig.  2 . The algorithms such as Faster RCNN (Ren et al. 2015 ), Mask RCNN (He et al. 2017 ), Cascade RCNN (Cai and Vasconcelos 2018 ), FPN (Lin et al. 2017a , 2017b ) and R-FCN (Dai et al. 2016 ) etc., fall under two-stage detectors whereas YOLO (Redmon and Farhadi 2018 ), SSD (Liu et al. 2016 ), RefineDet (Zhang et al. 2018a , 2018b ) and RetinaNet (Lin et al. 2017a , 2017b ) under one-stage detectors. The advanced object detectors such as CornerNet (Law and Deng 2018 ), Objects as points (Zhou et al. 2019a ) and Foveabox (Kong et al. 2020a , 2020b ) are listed in Fig.  2 . However, the algorithms listed above often include a large number of channels and convolutional layers, which demand a lot of computing power for deployment in edge devices. The deep learning-based lightweight object detectors presented in Fig.  2 are specifically designed for contexts with limited resources. Due to their efficiency and compactness, the one and advanced stage detectors’ pipeline is the industry standard for designing lightweight object detectors.

figure 2

Taxonomy of recent deep learning-based object detection algorithms

3 Deep learning-based lightweight object detection models for edge devices

Numerous computer vision tasks, such as autonomous driving, robot vision, intelligent transportation, industrial quality inspection, object tracking, etc., have used deep learning-based object detection to a large extent. Deep models typically improve performance, but the deployment of real-world applications onto edge devices is constrained by their resource-intensive network. Lightweight mobile object detectors have drawn growing research interest as a solution to this issue, with the goal of creating extremely effective object detection. Deep learning-based lightweight object detectors have recently been developed for situations with limited computer resources, such as mobile devices.

The necessity to execute backbone designs on edge devices with constrained memory and processing power stimulates research and development of deep learning-based lightweight object identification models. A number of efficient lightweight backbone architectures have been proposed in recent years, for example, MobileNet (Howard et al. 2017 ), ShuffleNet (Zhang et al. 2018a , 2018b ), and DetNaS (Chen et al. 2019 ). However, all these architectures are heavily dependent on widely deployed depth-wise separable convolution-based methodologies (Ding et al. 2019 ). With regard to deep learning-based lightweight object identification models, we will describe methodology and each component in depth in the following sections. Our deep learning-based simple object detection models were heavily influenced by existing simple and complex object detection models. We give an architectural breakdown of deep learning-based lightweight object detection models in the following section.

3.1 Architecture methodology of lightweight object detection models

The different building blocks of deep learning-based lightweight object detection algorithms on edge devices consist of number of components consisting of input, backbone, neck and detector head. The definition and details of each component is tabulated in Table  2 . An input for the lightweight object detector is either an image, patch or pyramid, initially fed in the lightweight backbone architecture such as CSPDarkNet (Redmon and Farhadi 2018 ), ShuffleNet (Zhang et al. 2018a , 2018b ), MobileNet (Qian et al. 2021 ), PeleeNet (Wang et al. 2018 ) for the calculation of feature maps. The backbone is the part of deep learning-based lightweight object detection architecture which converts an image to feature maps, whereas the neck transforms the feature maps by connecting the backbone to detector head. The input image is passed to lightweight backbone architecture to calculate initial features vectors of objects. This backbone network may be a pre-trained network or a neural network built from scratch with the aim of feature extraction. The backbone architecture performs feature extraction and produces feature maps as an output. Then, neck component transforms this feature map to a required feature vector for handling various object detection challenges as per application. The lightweight detector head can be visualized as a deep neural network focusing on extraction of RoIs. Further, some pooling layer fixes the size of calculated RoIs to calculate final features of the detected objects. The final features are then passed onto classification and regression loss functions to assign class labels and regressing the coordinates values of bounding boxes. This whole process is repeated until the final regressed values of bounding boxes are obtained with the required class labels. The detailed methodology as presented in Fig.  3 , deep learning-based lightweight object detection consist of three parts i.e., backbone architecture, neck components, and lightweight head prediction. The input images are fed to the backbone and their architecture converts the input image into feature maps. In case of deep learning-based lightweight models, the backbone architecture should be deployed from given categories in Table  2 . The Conv2D + batch normalization + ReLU activation function is represented by a fundamental convolutional module that makes up the backbone architecture. By eliminating redundant gradient information from CNN’s optimization process and integrating gradient modifications into the feature map, it lowers input parameters and model size (Wang et al. 2020a , 2020b , 2020c ). In the bottleneck cross stage partial darknet model, for instance, a 640 × 640-pixel image is divided into four 320 × 320-pixel images, which are then combined to form a 320 × 320-pixel feature map. This 320 × 320x32 resulting feature map was produced using 32 convolutional kernels. Additionally, include the SPP module to add features of various sizes and increase the network’s receptive area. By enhancing the information flow between the backbone architecture and the detecting head, the neck alters the feature maps. The neck, PANet is built on a FPN topology utilized to provide strong semantic characteristics from top to bottom (Wang et al. 2019 ). FPN layers from bottom to top also express important positional features.

figure 3

Methodology of deep learning-based lightweight object detection model

Furthermore, PANet encourages the transmission of low-level characteristics and the use of precise localization signals in the bottom layers. This improves the target object’s position accuracy. The prediction layer, sometimes referred to as the detection layer, creates many feature maps in order to accomplish multiscale prediction. However, at the prediction layer, the model is capable of classifying and detecting objects of various sizes. As a result, it is projected that each feature map will have various regression bounding boxes at each position, yielding various regression bounding boxes. The anticipated output of the model with bounding boxes is then shown as a detection result. The three steps mentioned above combine the training model for detection into the lightweight object detection model. After model training, the test data is passed to get fine-tuned lightweight model with modified features as shown in Fig.  3 . The parameters in context of deep learning-based light-weight models are discussed below:

To train an edge-cloud-based deep learning model, edge devices and cloud servers must share model parameters and other data. More data must be transferred between edge devices and cloud servers as the training model gets bigger. A number of methods have been put forth to lower the cost of communication during training, including Edge Stochastic Gradient Descent (eSGD), which can reduce a CNN model’s gradient size by up to 90% by communicating only the most important gradients, and intermediate edge aggregation prior to federated learning server aggregation. The two main components of training deep learning-based lightweight detection models are the ability to exit before the input data completes a full forward pass through each layer of a neural network distributed over heterogeneous nodes and the use of binarized neural networks to reduce memory and compute load on resource-constrained end devices (Koubaa et al. 2021 ; Dey and Mukherjee 2018 ).

Researchers have created a novel architecture known as Agile Condor that carries out real-time computer vision tasks using machine learning methods. At the network edge, close to the data sources, Agile Condor can be utilised for autonomous target detection (Isereau et al. 2017 ). Precog is a new method that lowers latency for mobile applications by prefetching and caching that anticipates the subsequent classification request and uses end-device caching to store essential portions of a trained classifier. As a result, fewer offloads to the cloud occur and edge servers calculate the likelihood that linked end devices may make a request in the future. These pre-fetched modules function as smaller models that minimise network traffic and cloud processing while accelerating inference on the end devices (Drolia et al. 2017 ). Another example include ECHO is a feature-rich, thoroughly tested framework for implementing data analytics in a distributed hybrid Edge-Fog-Cloud configuration. ECHO offers services such virtualized application status monitoring, resource discovery, deployment, and interfaces to data analytics runtime engines (Ogden and Guo 2019 ).

When feasible, distributed deep network designs enable the deployment on edge-cloud infrastructure to support local inference on edge devices. A distributed neural network model’s ability to function effectively on minimising inter-device communication costs. Inference on the end-edge-cloud architecture is a dynamic problem because of evolving network conditions (Subedi et al. 2021 ). Static methods like remote inference only or on-device inference only are also not the best. Ogden and Guo have created a distributed architecture that provides a flexible answer to this problem for mobile deep inference. A centralised model manager will house many deep learning models, and the inference environment (memory, bandwidth, and power) will be used to dynamically determine which model should run on which device. If resources are scarce in the inference environment, one of the compressed models may be employed; if not, an uncompressed model with higher accuracy is used. Edge servers handle remote inference when networks are sluggish.

Privacy and security

Edge devices can be used to filter personally identifiable information prior to data transfer in order to enhance user privacy and security when processing data remotely (Xu et al. 2020 ; Hu et al. 2023a , 2023b ). Since data generated by end devices is not available to a central location, training deep learning models across several edge devices in a distributed way leads to more privacy. Personally identifiable information in photographs and videos can be removed at the edge before being uploaded to an external server, enhancing user privacy. The privacy of critical training data becomes an issue when training is conducted remotely. To ensure local and global privacy techniques, it is imperative to keep an eye out for any decline in accuracy, ensure low computing overheads, and provide resilience to communication errors and delays (Abou et al. 2023 ; Makkar et al. 2021 ).

3.2 Comprehensive analysis of lightweight object detection models

The development of extremely effective object detection outcomes has garnered increasing scientific attention in the small, transportable object detectors. With the use of efficient components and compression techniques like pruning, quantization, hashing, and other techniques, the effectiveness of deep learning lightweight object identification models has grown. Distillation, which employs a large network that has been used to train smaller models, has produced some surprising results as well. A comprehensive list containing multiple details of deep learning-based lightweight object detection models in the recent years is presented in Tables 3 , 4 . The categorization of anchor-based and anchor-free detectors for lightweight object detectors have been identified. Anchor-based methods are the mechanism of extracting RoIs employed in object detection models, such as Fast R-CNN (Girshick 2015 ). The anchor boxes are of various scales, which can be viewed as RoIs, as a priori for performing bounding box regression for coordinates values. The detectors including YOLOv2 (Redmon and Farhadi 2017 ), YOLOv3 (Redmon and Farhadi 2018 ), YOLOv4 (Bochkovskiy et al. 2020 ), RetinaNet (Lin et al. 2017a , 2017b ), RefineDet (Zhang et al. 2018a , 2018b ), EfficientDet (Tan et al. 2020 ), Faster R-CNN (Ren et al. 2015 ), Cascade R-CNN (Cai and Vasconcelos 2018 ), Trident-Net (Li et al. 2019 ), belonging to one and two-stage detectors have anchor mechanism to elevate the performance of deep learning-based object detection. Besides, anchor-free detectors have recently received more attention in academia and research by witnessing a large number of new anchor-free methods have been proposed. Earlier works such as YOLOv1 (Redmon et al. 2016 ), DenseBox (Huang et al. 2015 ) and UnitBox (Yu et al. 2016 ) can be considered as early anchor-free detectors. In anchor-free methods, anchor and key points are utilized to perform detection. The former approach does object bounding box regression-based on anchor points instead of anchor boxes, including FCOS (Detector 2022 ), FoveaBox (Kong et al. 2020a , 2020b ), whereas latter approach reformulates the object detection as keypoints localization problem, including CornerNet (Law and Deng 2018 ; Law et al. 2019 ), CenterNet (Duan et al. 2019 ), ExtremeNet (Zhou et al. 2019b ) and RepPoint (Yang et al. 2019 ). By eliminating the handcraft anchors’ restrictions, anchor-free techniques have a lot of promise for working with extremely large and small objects. The anchor-based detectors shown in Table  3 can compete with some newly proposed anchor-free lightweight object detectors in terms of performance. Further, input image type, code link and published sources are also mentioned in Table  3 . While Table  4 reports crucial milestones such as AP, description, loss function etc. for individual deep learning-based light-weight detector.

Tiny-DSOD (Li et al. 2018 ) a lightweight object detector inspired by a thoroughly supervised object detection framework, has been proposed for resource-constrained applications. With only 0.95 M parameters and 1.06B FLOPs, it uses depth-wise dense block as a backbone architecture and depth-wise FPN in neck components, which is by far the most advanced result with such a small resource demand. The context enhancement module and the spatial attention module of ThunderNet (Qin et al. 2019 ), a lightweight two-stage detector, are used as the backbone architectural blocks to produce more discriminative feature representation representation. The effective RPN used in a portable detecting head. ThunderNet outperforms earlier lightweight one-stage detectors by operating at 24.1 frames per second with 19.2 AP on COCO on an ARM-based smartphone. One of the most recent, cutting-edge lightweight object detection algorithms, PP-YOLO (Long et al. 2020a , 2020b ) employs MobileNetV3 (Qian et al. 2021 ), a practical backbone architecture for edge devices. The depth-wise separable convolutions used by PPYOLOtiny’s detection head make it better suited for mobile devices. PPYOLOtiny adopts the optimisation techniques used by PPYOLO algorithms but does away with techniques that have a big impact on model size and performance. Block-punched pruning and a mobile acceleration unit with a mobile GPU-CPU collaboration approach are provided by YOLObile (Cai et al. 2021 ). Trident-YOLO (Wang et al. 2022a , 2022b , 2022c , 2022d ) is an upgrade to YOLOV4-tiny (Jiang et al. 2020 ), designed for mobile devices with limited computing power. In neck components, Trident FPN (Picron and Tuytelaars 2021 ) improves the recall and accuracy of basic object recognition methods by reorganising the network topology of neck components. Trident-YOLO proposes fewer cross-stage partial RFBs and smaller cross-stage partial SPPs, as well as enlarging the receptive field of the network with the fewest FLOPs. Conversely, Trident-FPN significantly enhances lightweight object detection performance by increasing the computational complexity through an increase in a limited number of FLOPs and producing a multi-scale model feature map. In order to simplify computation, YOLOV4-tiny (Jiang et al. 2020 ) uses two ResBlock-D modules in place of two CSPBlock modules in the ResNet-D network. In order to extract more feature information about the object, such as global features, channel, and spatial attention, it also creates an auxiliary residual network block with consecutive 3 × 3 convolutions that is utilized to obtain 5 × 5 receptive fields with the goal of reducing detection error. Optimizing the original YOLOv4 (Bochkovskiy et al. 2020 ), Slim YOLOv4 (Ding et al. 2022 ) changes the backbone architecture from CSPDarknet53 to MobileNetv2 (Sandler et al. 2018 ). Separable convolution and depth-wise over-parameterized convolutional layers were chosen to minimize computation and enhance the performance of the detection network. Based on YOLOv2 (Redmon and Farhadi 2017 ; Wang et al. 2022a ), YOLO-LITE (Huang et al. 2018 ; Wang et al. 2021a ) offers a quicker, more effective lightweight variant for mobile devices. On a PC without a GPU, YOLO-LITE works at roughly 21 frames per second and 10 frames per second when used on a website with only 7 layers and 482 million FLOPS. Object recognition using Fully Convolutional One‐Stage (FCOS) (Detector 2022 ) addresses the issue of label overlap within the ground-truth data. Unlike previous anchor-free detectors, there is no complex hyper-parameter adjustment. Large-scale server detectors constitute the majority of anchor-free detectors in general. The two small minority of anchor-free mobile device detectors are NanoDet (Li et al. 2020a , 2020b ) and YOLOX-Nano (Ge et al. 2021 ). The issue is that compact anchor-free detectors typically struggle to strike a good balance between efficiency and accuracy. In order to choose positive and negative samples, the FCOS method NanoDet employs Adaptive Training Sample Selection (ATSS) (Zhang et al. 2020a , 2020b , 2020c ) and uses generalised focal loss as the loss function for classification and bounding box regression. The centerness branch of FCOS and numerous convolutions on this branch are eliminated by the application of this loss function, which lowers the computational cost of the detection head. A lightweight detector dubbed L-DETR (Li et al. 2022a ) is created based on DETR and PP-LCNet to balance efficiency and accuracy. L-DETR has fewer parameters with the new backbone than the DETR. It is utilised to compute the overall data and arrive at the final prediction. Its normalisation and FFN are enhanced, and thus raises the precision of frame detection. In Table  5 , some well-known metrics in calculating performance of lightweight object detection models have been highlighted. The metrics termed FLOPs are frequently used to determine how computationally complex deep learning models are. They provide a quick and simple method of figuring out how many arithmetic operations are needed to complete a particular computation. It can offer extremely helpful insights on computational costs or requirements or energy consumption, which is particularly helpful for edge computing. It is useful when we have to estimate the total number of arithmetic operations needed, which is usually when computing efficiency is being measured. As highlighted, YOLOv7-x has highest FLOPs i.e., 189.9G among the mentioned detectors. One of the more important components of using a deep network architecture in deployment is the network latency/inference time. The majority of real-world applications need inference times that are quick—a few milliseconds to a second. It needs in-depth knowledge to measure a neural network’s inference time accurately. The time it takes for a deep learning algorithm to process fresh input and produce a prediction is known as the inference time in deep learning. The number of layers, the complexity of the network, and the number of neurons in each layer can all impact this time. Inference times typically rise with network complexity and scale. In our analysis, YOLOv3-Tiny has lowest inference time of 4.5 ms. The Frame Per Second (FPS) is a measure of how rapidly a deep learning model can handle frames. It also specifies how quickly your object detection model will process your photos and videos and produce the desired results. YOLOv4-Tiny has highest FPS among presented ones in Table  5 . Weight and bias are the model parameters in deep learning, which are characteristics of the training data that will be discovered throughout the learning process. The total number of parameters, which is a common indicator of a model’s performance, is the sum of all the weights and biases on the neural network. YOLO-X Nano has least learning parameters when compared with others. Moreover, with respect to each light-weight object detector, prediction regarding deployment of individual detector in real-time applications has been done on the basis of their AP values highlighted in Table  4 . MobileNet-SSD, MobileNetV2-SSDLite, Tiny-DSOD, Pelee, YOLO-Lite, MnasNet-A1 + SSDLite, YOLOv3-Tiny, NanoDet and Mini YOLO are not efficient when deployed.

Additionally, in latest years, one-stage YOLO-based lightweight object detectors have been developed which are mentioned in Table  6 . In 2024, DSP-YOLO (Zhang et al. 2024 ) and YOLO-NL (Zhou 2024 ) emerged but not ready to be deployed in real-life applications. On the contrary, EL-YOLO (Hu et al. 2023a , 2023b ), YOLO-S (Betti and Tucci 2023 ) GCL-YOLO (Cao et al. 2023 ), Light YOLO (Yin et al. 2023 ), Edge YOLO (Li and Ye 2023 ), GL-YOLO-Lite (Dai and Liu 2023 ) and LC-YOLO (Cui et al. 2023 ) can be merged in real-life applications of computing world. Further, we have added performance parameters in terms of FLOPs, Inference time, FPS and number of parameters with respect to each latest YOLO-based light-weight object detector. YOLO-S utilized least number of FLOPs i.e., 34.59B whereas Light YOLO has maximum FPS of 102.04 and GCL-YOLO has lease number of parameters as depicted in Table  6 .

3.3 Backbone architecture for deep learning-based lightweight object detection models

Deep learning-based models for image processing advanced and effectively outperformed more conventional methods in terms of object classification (Krizhevsky et al. 2012 ). The most effective deep learning object categorization architectures have been Convolutional Neural Networks (CNNs), which function similarly to human brains and include neurons that react to their surroundings in real time (Makantasis et al. 2015 ; Fernández-Delgado et al. 2014 ). Well-known CNN architectures based on deep learning have been used for object classification-based feature extractors to fine-tune the classifiers. Forward propagation is used to process the training with random seeds for the filters and parameters. However, due to severely resource-constrained conditions, notably in memory bandwidth, the development of specialised CNN architectures for lightweight object identification models has received less attention than expected. In this section, we summarized backbones i.e., feature extractors for deep learning-based lightweight object detection models. Backbone architectures are used to extract the features for conducting lightweight object identification tasks where an image is provided as an input and a feature map is produced as an output. The majority of backbone architectures for detection tasks are essentially networks for classification problems, which take into account the final fully linked layers. DetNaS convolutional neural network is shown in Fig.  4 to help understand how backbone architectures function in the context of lightweight object identification models. These architectures are shown block-by-block. ShuffleNetv2 5*5 and 7*7 blocks are what the blue and green blocks are made of. Kernel sizes for the blue blocks are 3. In comparison to pink 3*3 ShuffleNetv2 blocks, the peach colour blocks are Xception ShuffleNetv2 blocks (Ma et al. 2018 ). Each level has eight blocks and the total number of blocks is forty. Large-kernel blocks are found in low-level layers while deep blocks are found in high-level layers in the lightweight DetNAS architecture. Blocks of huge kernels (5*5, 7*7) are present in DetNAS’ stage 1 and stage 2’s low-level layers. Pink-colored blocks, on the other hand, have kernels that are 3*3. Stages 3 and 4 are comprised of peach and pink blocks, as shown in the centre of Fig.  4 . Six of these eight blocks—Xception and ShufflNetv2 blocks—are deeper than standard 3*3 ShufflNetv2 blocks. These results lead us to the conclusion that lightweight object detection networks differ visually from conventional detection and classification networks. In the next section, a brief introduction about deep learning-based lightweight backbone architectures have been given:

figure 4

Architectural details of backbone architecture DetNaS (Chen et al. 2019 )

3.3.1 MobileNet (Howard et al. 2017 )

MobileNet created an efficient network architecture made up of 28 depth-wise separable convolutions to factorise a standard convolution into a depth-wise convolution and a 1 × 1 point-wise convolution. By applying different kernels, isolating the filtering, and merging the features using pointwise convolution in depth-wise convolution, the computing cost and model size were reduced. Two more model-shrinking hyperparameters, width and resolution multiplier, were added in order to improve performance and reduce the size of the model. The model’s oversimplification and linearity, which resulted in fewer channels for gradient flow, were corrected in later versions.

3.3.2 MobileNetV2 (Sandler et al. 2018 )

The inverted residual with linear bottleneck, a novel module, was added to MobileNetv2 in order to speed up calculations and improve accuracy. In the MobileNetv2 there were two convolutional layers followed by 19 bottleneck modules. The computationally efficient MobileNetv2 feature extractor was used by the SSD writers to detect objects. With respect to the original SSD, the new device, known as SSDLite, touted an 8 × reduction in parameters. It is simple to construct, generalises well to different datasets, and as a result, garnered positive feedback from the community.

3.3.3 MobileNetv3 (Qian et al. 2021 )

In MobileNetv3, the unneeded portions of the network were iteratively removed during an automated platform search in a factorised hierarchical search space. This model is then modified to increase the desired metrics after the design proposal has been prepared. Since the architecture’s filters regularly reflect one another, accuracy can be maintained even if half of them are discarded, which reduces the need for further processing. MobileNetv3 merged harsh Swish and RELU activation filters because the latter was computationally more efficient while preserving accuracy.

3.3.4 ShuffleNet (Zhang et al. 2018a , 2018b )

According to the authors, many effective networks lose their effectiveness as they scale down because of expensive 1 × 1 convolutions. ShuffleNet is a neural network design that was very computationally effective and created especially for mobile devices. To overcome the issue of restricted information flow, it was suggested to use group convolution along with channel shuffling. The ShuffleNet unit, like the ResNet block, substituted a pointwise group convolution for the 1 × 1 layer and a depth-wise convolution for the 3 × 3 layer.

3.3.5 ShuffleNetv2 (Ma et al. 2018 )

ShuffleNetv2 advocated in favour of using speed or latency as direct measures rather than FLOPs or other indirect metrics to determine how complex a computation is. Four guiding principles served as its foundation: equal channel width to lower memory access costs, group convolution selection based on target platform, multi-path ways to boost accuracy, and element-wise operations. The input was split in half by a channel split layer in this model, and the residual link was concatenated by three convolutional layers before being sent through a channel shuffle layer. ShuffleNetv2 outperformed other cutting-edge models of comparable complexity, outperforming its peers.

3.3.6 PeleeNet (Wang et al. 2018 )

PeleeNet is an inventive and effective architecture based on traditional convolution that was created using a number of computation-saving strategies. PeleeNet’s design comprised four iterations of modified dense and transition layers, followed by the classification layer. The two-way dense layer helps to obtain distinct receptive field scales, which makes it simpler to identify larger things. Using a stem block minimised the loss of information. While our model’s performance was not as good as modern object detectors on mobile and edge devices, it did demonstrate how even seemingly little design decisions can have a substantial impact on total performance.

3.3.7 mNASNet (Tan et al. 2019 )

Using NAS (Neural Architecture Search) automation, mNASNet was created. It conceptualised the search problem as a multi-object optimisation problem, with a dual focus on latency and accuracy. Unlike previous models that stacked identical blocks, this allowed for the design of individual blocks. By dividing the CNN into distinct blocks and then looking for operations and connections in each of those blocks separately, it factorised the search space. mNASNet was roughly twice as rapid as MobileNetv2 and more accurate.

3.3.8 Once for all (OFA) (Cai et al. 2019 )

In recent years, modern models have been constructed using NAS for architecture design; nonetheless, the sampled model training resulted in costly computations. This model just needs to be trained once, after which sub-networks can be constructed from it based on the requirements. Thanks to the OFA network, such sub-networks can be variable in the four key dimensions of a convolutional neural network: depth, width, kernel size, and dimension. They slowed down the training process and caused layering within the OFA network, which eventually resulted in gradual shrinkage.

3.3.9 MobileViT (Mehta and Rastegari 2021 )

Combining the benefits of CNNs and Vision Transformers (ViT), MobileViT is a transformer-based detector that is lightweight, portable, and compatible with edge devices. It was able to successfully identify both short- and long-range dependencies by utilising a unique MobileViT block. Alongside the MobileViT block, MobileNetv2 modules (Sandler et al. 2018 ) were made available in serial form. Unlike previous transformer-based networks, they used a transformer as a convolution, which automatically incorporated spatial bias, therefore location encoding was not necessary. MobileViT performed well on complex problems, supporting its claim to be a general-purpose backbone for various vision applications. Because of the constraints of transformers on mobile devices, it was able to attain better accuracy with a smaller parameter budget.

3.3.10 SqueezeNet (Iandola et al. 2016 )

SqueezeNet attempts to maintain the accuracy of the network by using techniques with fewer parameters. Smaller filters, 3 × 3 filters for the input channels, and later network placement of the down-sampling layers were the design strategies employed. SqueezeNet’s core module, the fire module, consisted of an extended layer and a squeeze layer, each containing a ReLU activation. Eight Fire modules were stacked and jammed in between the convolution layers to form the SqueezeNet architecture. Accuracy was increased over the basic model using SqueezeNet with residual connections, which was also developed and inspired by ResNet (He et al. 2016 ). SqueezeNet showed out as a serious contender for boosting the hardware efficiency of neural network topologies.

The year and initial usage in which backbone architectures have been utilized, the number of parameters, merits, and top-1 accuracy have been elaborated in the given Table  7 . According to research on deep learning-based backbone architectures, SqueezeNet (Iandola et al. 2016 ) and ShuffleNetv2 (Ma et al. 2018 ) are the most widely used lightweight backbone architectures used in edge devices today. The performance of the model built from depth-wise separable convolutions, inverted residual topologies with linear bottlenecks, and automatic complementary search structures is gradually enhanced by the MobileNet series (Howard et al. 2017 ; Qian et al. 2021 ; Sandler et al. 2018 ).

4 Performance analysis of deep learning-based lightweight object detectors

In this section, a comprehensive analysis has been made from above-discussed lightweight object detectors and related backbone architectures. It can be said that lightweight object detectors based on deep learning strike a balance between accuracy and efficiency. Although the above-mentioned lightweight detectors from the previous sections have a quick inference rate, accuracy isn’t always up to par for some jobs. As shown in Fig.  5 , performance evaluation of deep learning-based lightweight object detectors in terms of mAP on MS-COCO dataset, lightweight object detector YOLOv7-x performs best among mentioned detectors. The backbone architectures in deep learning-based lightweight object detectors play a vital role in determining the accuracy of models. The convolutional architectures specifically designed for edge devices in terms of limited bandwidth usage would be an ideal choice for embedding in detection models. The top-1 accuracy comparison of deep learning-based lightweight backbone architectures in detection models is presented in Fig.  6 . The backbone architecture ShuffleNetV2 attains 70.9 accuracy, a large jump from SqueezeNet (Iandola et al. 2016 ) results. A marginal accuracy increase can be seen in the architectures such as PeleeNet (Wang et al. 2018 ), DetNas (Chen et al. 2019 ), mNASNet (Tan et al. 2019 ), GhostNet (Han et al. 2020a , 2020b ) but the recently emerged transformers-based architecture, i.e., MobileViT (Mehta and Rastegari 2021 ) achieves best state-of-the-art results. Moreover, from year 2017 to 2023, we have shown literature summary in terms of number of publications for deep learning-based lightweight backbone architectures in Fig.  7 . The most popular architecture SqueezeNet has been utilized over years in lightweight detectors as shown in Fig.  7 . GhostNet (Paoletti et al. 2021 ) and MobileViT (Mehta and Rastegari 2021 ) backbone architectures have more literature in 2022 and 2023 year. As mentioned above, the state-of-the-art object detection works are either accuracy-oriented using a large model size (Ren et al. 2015 ; Liu et al. 2016 ; Bochkovskiy et al. 2020 ) or speed- oriented using a lightweight model but sacrificing accuracy (Wang et al. 2018 ; Sandler et al. 2018 ; Li et al. 2018 ; Liu and Huang 2018 ). It is difficult for any of existing lightweight detectors to meet the accuracy and latency requirements of real-world applications on mobile and edge devices at the same time. Therefore, we require a mobile device solution that can accomplish both high accuracy and low latency to deploy lightweight object detection models.

figure 5

mAP Performance evaluation of major deep learning-based lightweight object detectors

figure 6

Accuracy comparison of deep learning-based lightweight backbone architectures in detection models

figure 7

Year-wise literature summary of backbone architectures in case of lightweight detection models

4.1 Benchmark detection databases for light-weight object detection models

In this section, most popular datasets have been discussed concerning to deep learning-based lightweight object detectors. Datasets are essential for lightweight object detection because they allow for standard comparisons of competing algorithms and the establishment of objectives for solutions.

4.1.1 PASCAL VOC (Everingham et al. 2010 )

The most well-known object detection dataset is this one. The PASCAL-VOC versions VOC2007 and VOC2012 are frequently used in papers. 2501 training, 2510 validation, and 5011 testing images make up VOC2007. VOC2012, on the other hand, comprises of 10,991 training, 5823 validation images, and 5717 testing images. The PASCAL VOC datasets include 11,000 images spread across 20 visual object classes. Animals, vehicles, people, and domestic things are the four broad categories into which these 20 classes can be divided. Additionally, classifications of objects with semantic similarities, such as trucks and buses, enhance the complexity levels for detection. Visit http://host.robots.ox.ac.uk/pascal/VOC/to get the dataset.

4.1.2 MS-COCO (Lin et al. 2014 )

A sizable image dataset called MS-COCO (Microsoft Common Objects in Context) contains 328,000 photographs of commonplace items and people. It is now one of the most well-liked and difficult object detection datasets. It has 897,000 tagged objects in 164,000 photos across 80 categories. For the training, validation, and testing sets, there are 118,287, 5000, and 40,670 photos, respectively. The distribution of objects in MS-COCO is more in line with real-world circumstances. There is no information available regarding the MS-COCO testing set’s annotations. The following categories of annotations are offered by MS-COCO, including those for captioning, keypoints, panoptic segmentation, dense pose, and object detection. The MS-COCO dataset provides a wide range of realistic images, showing disorganized scenes with various backgrounds, overlapping objects, etc. The URL of the dataset is http://cocodataset.org .

4.1.3 KITTI (Geiger et al. 2013 )

It is a well-known dataset for traffic scene analysis and includes 7518 photos for testing and 7481 for training that have been labelled. There are 100,000 pedestrian cases, 6000 IDs, and an average of one person per photograph. The pedestrian and cyclist are the two subclasses of the human class in KITTI. Based on how much the objects are obscured and shortened, the object labels are divided into easy, moderate, and hard levels. In this dataset, there are two subcategories for people: pedestrians and cyclists. Utilizing three criteria that differ in the minimum bounding box height and maximum occlusion level, the models trained on it are assessed. Visit http://www.cvlibs.net/datasets/kitti/index.php to download the dataset.

We have presented performance of deep learning-based lightweight detection models on above-discussed detection datasets in Fig.  8 . The lightweight object detector YOLOv4-dense achieves mAP value of 84.30 on KITTI dataset, 71.60 on PASCAL VOC dataset. L4Net detector attains mAP value of 71.68 on KITTI, 82.30 on PASCAL VOC and 42.90 on MSCOCO dataset. RefineDet-lite detector achieves mAP value of 26.80 on MSCOCO dataset. Further to compare performances, FE_YOLO performs best on KITTI dataset as presented in Fig.  8 whereas L4Net detector performs best on MSCOCO dataset and finally, lightweight YOLO-Compact detector outperforms other detectors on PASCAL VOC dataset.

figure 8

Performance evaluation of deep learning-based lightweight models on leading datasets

4.2 Evaluation parameters

Lightweight object identification models based on deep learning use the same evaluation criteria as generic object detection models. Out of all predictions made, accuracy is the proportion of things that were successfully anticipated. When dealing with class unbalanced data, where the number of instances is not equal for each class, the accuracy result can be quite deceptive because it places more emphasis on learning the majority classes than the minority classes. Therefore, mean Average Precision (mAP), Frames Per Second, and the size of the model weight file serve as the primary evaluation indices for the effectiveness of the lightweight object identification model. The correct labelling data for each image provides the precise number of objects in each category in the image. Intersection Over Union (IoU) quantifies the similarity between the ground and predicted bounding box to evaluate how good the predicted bounding box is as represented in Eq. ( 1 ):

The calculation of IoU value takes place between each prediction box and ground data. Then consider the largest IoU value, and, based on the IoU threshold, we can calculate the number of True Positives (TP) and False Positives (FP) for each object category in an image. From this, the Precision of each category is calculated according to Eq. ( 2 ):

When the correct number of TP is obtained, the number of False Negatives (FN) are measured through Recall as in Eq. ( 3 ).

By figuring out various recall rates and associated accuracy rates for each category, PR curves for each can be plotted. The value of AP is identical to the region enclosed by the PR curve in the PASCAL VOC 2010 object detection competition evaluation criteria. Precision, recall rate, and average accuracy are three metrics that can be used to assess the model’s accuracy for detecting tasks. MS COCO averages mAP with a step of 0.05 over a range of IoU thresholds (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95). The main metric used to judge competitors is called “mAP,” which averages AP over all 80 COCO dataset categories and all 10 criteria. A higher AP score according to the COCO evaluation criteria denotes flawless bounding box localization of the discovered items. The typical COCO-style AP metric, which averages APs over IoU threshold ranges of 0.5 to 0.95 with 0.05 steps. The performance is measured using AP50, AP75 at various IoU thresholds and AP s , AP m , and AP l on objects that are small, medium, and large in size. By averaging over all 10 IoU thresholds across all categories with a uniform step size of 0:05, the primary metric, AP(IoU) = 0.50:0.05:0.95, is determined.

4.3 A summary of edge devices-based platforms for lightweight object detectors

In the upcoming years, a ton of data will be produced by mobile users and IoT devices. Data growth will bring new problems like latency. Additionally, traditional methods cannot be relied upon for very long if intelligence is to be derived from deep learning-based object detection and recognition algorithms in real-time. Edge computing devices have drawn a lot of interest as a result of prominent firms’ efforts to make supercomputers affordable. It is vital to enable developers to swiftly design and deploy edge applications from lightweight detection models as the IoT, 5G, and portable processing device eras approach. As a result of advancements in the field of deep learning, numerous enhancements to object identification models have been presented that are aimed at edge device applications. DeepSense, TinyML, DeepThings, and, DeepIoT are just a few of the frameworks that have been published in recent years with the intention of compressing deep models for IoT edge devices. To satisfy the processing demands of deep learning-based lightweight object detectors, the model must be able to overcome several constraints like a limited battery, high energy consumption, limited computational capabilities, and a constrained memory while maintaining a level of accuracy. The primary goal should be to create a framework that makes it possible for machine learning models to be quickly implemented in Internet of Things devices. The well-known TinyML frameworks TensorFlow Lite from Google, ELL from Microsoft, ARM-NN and CMSIS-NN from ARM, STM 32Cube-Al from STMicroelectronics, and Alfes from Fraunhofer IMS enable the use of deep learning at the peripheral. When combined with other microcontroller-based tasks, low-latency, low-power, and low-bandwidth AI algorithms can function as part of an intelligent system at a low cost thanks to TinyML on a microcontroller. The DeepIoT framework reduces neural network designs into less dense matrices while preserving the performance of sensing applications by figuring out how few non-redundant hidden components, including filters and dimensions, are needed by each layer. Another well-known framework that provides deep learning-based lightweight object recognition is TensorFlow. TensorFlow Lite is a cross-platform, quick, and lightweight mobile and IoT framework (TFLite) to scale down their massive models. The majority of lightweight models employ TensorFlow lite quantization, which is easy to deploy on edge devices.

4.3.1 Mobile phones

The limitations imposed by mobile devices may be the reason why less research is being done on the deployment of object detectors on mobile phones than on other embedded platforms. Smartphone complexity and capabilities are rising quickly, but their size and weight are also probably going to decrease. Few literature studies have tried to perform implementation on smartphones-based devices (Lan et al. 2019 ; Liu et al. 2020a , 2020b , 2020c , 2020d ; Liu et al. 2021a , 2021b , 2021c ; Li et al. 2021a , 2021b , 2021c ; Paluru et al. 2021 ). It can be seen that it puts a heavy burden on creating models that are small, light, and require a minimum number of computations. It is advised to test novel concepts for deep learning inference optimization on transportable models that are regularly utilized with cellphones (Xu et al. 2019 ). Either the spatial or temporal complexity of deep learning models can be reduced to the point where they can be fully implemented on mobile devices. But there may be a lot of security issues that need to be fixed Steimle et al. ( 2017 ). Although deep learning for smartphone item detection appears to be a promising field of study, success will need many more contributions (Wang et al. 2022a , 2022b , 2022c , 2022d ).

4.3.2 IoT edge devices

Another way to enable deep learning on IoT edge devices is to transfer model inference to a cloud server. Another way to boost the power of these inexpensive devices is to add an accelerator. The price of using these accelerators is a major drawback, though. Some edge devices, like the Raspberry Pi, may require an extra accelerator, but some, like the Coral Dev Board, already have edge TPU accelerators built in. Deep learning can be more easily enabled to run locally or remotely using a distributed design that links computationally inefficient front-end devices with more potent back-end devices, like a cloud server or accelerator (Ran et al. 2017 ).

4.3.3 Embedded boards

To provide the finest design options, processor-FPGA combinations, and FPGAs with hard processor cores embedded into its fabric are widely used. Lattice Semiconductor, Xilinx, Microchip, and Altera from Intel are the well-known manufacturers. The literature suggests that the Xilinx boards family is the one that is most frequently utilized for deep learning-based applications. An additional accelerator is often needed when employing FPGA devices (Saidi et al. 2021 ) to get acceptable performance. Due to Integrated Development Environment (IDE) and high-level language support, the Arduino and Spark-based boards at the top of the device family allow for greater software level programming (Kondaveeti et al. 2021 ).

4.4 Applications specific to deep learning-based lightweight object detectors

In the above sections, we have discussed architectural details, leading datasets of deep learning-based lightweight object detection models. These models offer a multitude of applications such as in remote-sensing (Xu and Wu 2021 ; Ma et al. 2023 ), and aerial images (Xu and Wu 2021 ; Zhou et al. 2022 ), traffic monitoring (Jiang et al. 2023 ; Zheng et al. 2023 ), fire detection (Chen et al. 2023 ), indoor robots (Jiang et al. 2022 ), pedestrian detection (Jian et al. 2023 ) etc. A summary of literature findings for supporting the applications of deep learning-based lightweight object detection models is listed in Table  6 . In (Zhou et al. 2019 ), for Range Doppler (RD) radar pictures, a lightweight detection network called YOLO-RD has been proposed. Additionally, a brand-new, lightweight mini-RD dataset has been created for effective network training. On the mini-RD dataset, YOLO-RD produced effective results with a smaller memory budget and a detection accuracy of 97.54%. Regarding both algorithm and hardware resource aspects in object detection, (Ding et al. 2019 ) introduced REQ-YOLO, a resource conscious, systematic weight quantization framework for object detection. For non-convex optimisation problems on FPGAs, it applied the block-circulant matrix approach and proposed a heterogeneous weight quantization. The outcomes demonstrated that the REQ-YOLO framework can greatly reduce the size of the YOLO model while just slightly reducing accuracy. It is suggested that autonomous vehicles use the L4Net Locating object suggestions from (Wu et al. 2021 ) which integrates a key point detection backbone with a co-attention strategy by attaining cheaper computation costs with improved detection accuracy under a variety of resource constraints. To generate more precise prediction boxes, the backbone capture context-wise information and co-attention method specifically combined the strength of both class-agnostic and semantic attention. With a 13.7 M model size and speeds of 149 FPS on an NVIDIA TX and 30.7 FPS on a Qualcomm-based device, respectively, L4Net achieved 71.68% mAP. The development of effective object detectors requires the rapid development of CPU-only hardware because to the huge data processing and resource-constrained scenarios on GPUs. With three orthogonal training strategies—IoU-guided loss, classes-aware weighting method, and balanced multi-task training approach, (Chen et al. 2020a , 2020b ) proposed a lightweight backbone and light-head detecting component. On a single-thread CPU, the suggested RefineDetLite obtained 26.8 mAP at a pace of 130 ms/pic. LiraNet, a compact CNN, was suggested by (Long et al. 2020a , 2020b ) for the recognition of marine ship objects in radar pictures. By creating Lira-YOLO, a compact model that is simple to set up on mobile devices, LiraNet was mounted on the already-existing detection framework Darknet. Additionally, a lightweight dataset of distant Doppler domain radar pictures known as mini-RD had been created to test the performance of the proposed model. Studies reveal that Lira-YOLO’s network complexity is minimal at 2.980 Bflops, and its parameter quantity is reduced at 4.3 MB thanks to its high detection accuracy of 83.21%. (Lu et al. 2020 ) developed a successful YOLO-compact network for real-time object detection in the single person category. The down sampling layer was separated in this network, which facilitated the modular design by enhancing the remaining bottleneck block. YOLO-compact’s AP result is 86.85% and its model size is 9 MB, making it smaller than tiny-yolov3, tiny-yolov2, and YOLOv3. By focusing on small targets and background complexity, (Xu and Wu 2021 ) presented FE-YOLO for deep learning-based target detection from remote sensing photos. The analyses on remote sensing datasets demonstrate that our suggested FE-YOLO outperformed existing cutting-edge target detection methods. A brand-new YOLOv4-dense model was put forth by (Jiang et al. 2023 ) for real-time object recognition on edge devices. To address the issue of losing small objects and further minimize the computing complexity, a dense block had been devised. With 20.3 M parameters, YOLOv4-dense obtained 84.3% mAP and 22.6 FPS. To improve the detection of small and medium-sized objects in aerial photos, (Zhou et al. 2022 ) developed Dense Feature Fusion Path Aggregation Network (DFF-PANet). The trials were conducted using the HRSC2016 dataset and the DOTA dataset, yielding 71.5% mAP and 9.2 M as a lightweight model. To help an indoor mobile robot solve the problem of object detection and recognition, (Jiang et al. 2022 ) presented ShuffleNet-SSD. Deep separable convolution, point-by-point grouping convolution, and channel rearrangement were all created using the suggested model. A dataset has been created for the mobility robot under the indoor scene. For the detection of dead trees, (Wang et al. 2022a , 2022b , 2022c , 2022d ) suggested a novel, lightweight architecture called LDS-YOLO based on the YOLO framework. These plants assisted in the timely regeneration of dead trees, allowing the ecosystem to remain stable and efficiently withstand catastrophic disasters. With the addition of the SoftPool approach in Spatial Pyramid Pooling (SPP), a unique feature extraction module is provided that makes use of the features from earlier layers in order to ensure that small targets are not ignored. On the basis of UAV-captured photos, the suggested approach is assessed, and the experimental findings show that the LDS-YOLO architecture works well when compared to AP of 89.11% and parameter size of 7.6 MB. The categorization of several applications concerning to lightweight object detectors as shown in Table  8 with respect to image type such as remote-sensing, aerial, medical and video streams and application type of healthcare, medical, military and industrial use.

4.5 Discussion and contributions

According to the above-mentioned analysis of deep learning-based light-weight object detectors, there is a need of focus to develop detectors for edge devices which can strike a good balance between speed and accuracy. Furthermore, real-time deployment of these detectors on edge devices is also needed while achieving accuracy of lightweight detectors without compromising precision. In 2022, lightweight backbone architectures ShuffleNet and SqueezeNet have highest publications with respect to lightweight object detectors. In 2023, transformers based MobileViT started getting attention of researchers as top-1 accuracy of 78.4 is also achieved and MobileNet backbone architectures were maximum employed when compared with others. As shown in Table  8 , according to input type, video streams have maximum employability in deep learning-based light-weight object detectors. With respect to diverse applications, traffic and pedestrians related detection problems, obstacles and driving assistance have highest studies whereas all other existing applications have limited light weight detectors on edge devices. As we witnessed, the majority of presented light-weight models are from the YOLO family, a number of deep network layers with increasing number of parameters needing to account for the improved accuracy. Therefore, the most important question to ask when a model migrates from a cloud device to an edge device is how to lower the parameters of a deep learning-based lightweight model. There are numerous approaches being used to address this which are described in the next section.

4.6 Recommendations for designing powerful deep learning-based lightweight models

Researchers have created new training methods that decrease the memory footprint in the edge device and speed up training on low-resource devices, in addition to specialized hardware for the deep learning model training process at the network edge. The techniques which we discussed in this section- pruning, quantification, knowledge distillation, and low-rank decomposition, are the four key categories used to compress pre-training networks (Kamath and Renuka 2023 ) and listed in the following (Koubaa et al. 2021 ; Makkar et al. 2021 ; Wang et al. 2020a , 2020b , 2020c ):

4.6.1 Pruning

Network pruning is a useful technique for reducing the size of the object detection model and speeding up model reasoning. By cutting out connections between neurons that are irrelevant to the application, this method lowers the amount of computations needed to analyse fresh input. In addition to eliminating connections, it can also eliminate neurons that are deemed irrelevant when the majority of their weights are low in relation to the deep neural network’s overall context. With the use of this method, a deep neural network with reduced size, greater speed, and improved memory efficiency can be used in low-resource devices, such as edge devices.

4.6.2 Weights quantization

The weight quantization approach, which trades precision for speed, shrinks the model’s storage capacity by reducing the number of floating-point parameters. In addition to eliminating pointless associations, every weight is kept as separate values. The weights quantization technique aims to compress these values to integers or numbers that occupy as few bits as possible by clustering related weight values into a single value. Consequently, there will be a re-adjustment of the weights, indicating a modification of the precision as well. This results in a cyclical implementation where the weights are quantified following each training.

4.6.3 Knowledge distillation

Dissection of knowledge presents itself as a new mode of transfer learning. This technique can extract knowledge from a big and well-trained deep neural network, dubbed teacher in this case, into a reduced deep network, called student. By doing this, the student network can learn to achieve the same outcomes as the teacher network while decreasing in size and increasing processing speed. Through the process of knowledge distillation, information is transferred from a large, thoroughly trained end-to-end detection network to numerous, quicker sub-models.

4.6.4 Training tiny networks

The deep neural network’s initial convolution kernel is mostly broken-down using matrix decomposition in the low-rank decomposition method, although the accuracy of the results will noticeably improve. Directly training tiny networks can drastically reduce network accuracy loss and speed up reasoning.

4.6.5 Federated learning and model partition

Distributed learning or federated learning are two possible training approaches for dealing with complicated tasks or a period of training including a lot of data. The data would be broken into smaller groups that would be distributed among the nodes of the edge network. As part of the final deep neural network, each node would train based on the data it received, enabling active learning capabilities at the network edge. Model partitioning is a strategy that may be applied in the inferring phase using the same methodology. To divide the burden, a separate node would compute each layer of the deep neural network in a model split. This approach would also make scaling simple.

Moreover, to boost the flow of information in a constrained amount of time, multi-scale feature learning in lightweight detection models that comprise single feature maps, pyramidical feature hierarchies, and integrated features may be used. The feature pyramid networks, as well as their variations such as feature fusion, feature pyramid generation, and multi-scaled fusion module, aid in overcoming object detection difficulties. Additionally, in order to boost the effectiveness of lightweight object identification models, researchers also work to encourage the development of activation functions and normalization in various applications. Above-mentioned techniques accelerate the usage of deep learning models into edge devices. The deep learning-based lightweight object detection models have not yet achieved comparable results when compared with generic object detection. Moreover, to mitigate these differences, a need for designing powerful and innovative lightweight detectors is a must. Some recommendations for designing powerful lightweight deep learning-based detectors are mentioned in this section.

Incorporation of FPNs - The bidirectional FPN can be utilized to improve the semantic information while incorporating feature fusion operations (Wang et al. 2023a , 2023b ). To successfully collect bottom-up and top-down features more than FPN, an effective feature-preserving and refining module can be introduced (Tang et al. 2020a , 2020b ). Deep learning-based lightweight detectors can be designed with the help of cross-layer connections and the extraction of features at various sizes while using depth-wise separable convolution. It is possible to take advantage of a multi-scale FPN architecture with a lightweight backbone to take out features from the input image.

Transformer-based Solutions - To increase the precision of the transformer-based lightweight detectors, group normalisation can be implemented in the encoder-and-decoder module and h-sigmoid activation function in the multi-layer perceptron (Li, Wang and Zhang 2022).

Receptive Fields Enlargement - The capacity of single-scale features to express themselves and to be detected on a single scale are both improved by the multi-branch block involving various receptive fields. The network width may increase and performance may be slightly enhanced with the use of several network branches (Liu et al. 2022).

Feature Fusion Operation - In order to combine several feature maps of the backbone and the collection of multi-scale features into a feature pyramid, the fusion operation offers a concatenation model (Mao et al. 2019 ). To improve the extraction of information from the suggested lightweight model, the feature maps’ weight of various channels can be reassigned. Furthermore, performance improvement may result from the integration of the attention module and data augmentation technique (Li et al. 2022a , 2022b ). The smooth fusion of semantic data from low-resolution scale to neighbourhood high-resolution scale is made possible by the implementation of FPN into the suggested lightweight detector architecture (Li et al. 2018 ).

Effect of Depth-wise Separable Convolution - The optimal design principle for lightweight object detection models consists of fewer channels with more convolutional layers (Kim et al. 2016 ). The approach to network scaling that modifies width, resolution, and network’s structure to reduce or balance the size of the feature map, keep the number of channels constant after convolution, and minimise convolutional input and output is where researchers can concentrate (Wang et al. 2021b ). The typical convolution in the network structure can be replaced with an over-parameterized depth-wise convolutional layer, which significantly reduces computation and boosts network performance. To increase the numerical resolution, ReLU6 can be used in place of the activation function known as Leaky ReLU (Ding et al. 2022 ).

Increase in Semantic Information - To keep semantic features and high-level feature maps in the deep lightweight object network, the proposal of smaller cross-stage partial SPPs and RFBs facilitates the integration of high-level semantic information with low-level feature maps (Wang et al. 2022a , 2022b , 2022c , 2022d ). The architectural additions of the context enhancement and spatial attention module can be employed to generate more discriminative feature representation (Qin et al. 2019 ).

Pruning Strategy - Block-punched pruning uses a fine-grained structured pruning method to maximise structural flexibility and minimise accuracy loss. High hardware parallelism can be achieved using the block-punched pruning strategy if the block size is suitable and compiler-level code generation is used (Cai et al. 2021 ).

Assignment Strategy- To improve the training of lightweight object detectors based on deep learning, use the SIMOTA dynamic label assignment method. When creating lightweight detection models, the combination of the regression method based on FCOS, dynamic and learnable sample assignment, and varifocal loss handling class imbalance works better (Yu et al. 2021 ). Designing lightweight object detectors using the anchor-free approach has been successful when combined with other cutting-edge detection methods using decoupled heads and the top label assignment strategy, known as SimOTA (Ge et al. 2021 ).

There are two ways of deployment of deep learning-based lightweight models on edge devices are when a light-weight model or compressed data are employed to match the compute capabilities of the limited edge devices. With regard to on-board object detection, this is true. The compromise between compression ratio and detection accuracy in this method is its drawback. Secondly, the model is distributed and data is exchanged when computations are spread over several devices and cloud server could be able to handle the computations in this situation. In this case, privacy and security seem to be the primary issues (Zhang et al. 2020a , 2020b , 2020c ). Consideration must be given when establishing device coordination in this scenario as it may also result in extra overhead in order to avoid the edge devices being overworked while conducting the collaborative learning algorithm. No matter the plan, all of these deployment methods rely on edge devices and have to deal with the problems with edge devices present. The primary causes of the issue are data disparity in real-world scenarios and the need to manage real-time sensor data while performing numerous deep learning tasks. The powerful processing units, the high computing requirements of deep learning models, and short battery life makes validity of light-weight models tough. In the future, we’ll strive to create such standards-compliant light-weight detection deployment models.

5 Conclusion

This study asserted that deep learning-based lightweight object detection models are a good candidate for improving the hardware efficiency of neural network architectures. This survey has examined and provided the most recent lightweight edge gadget models. The commonly utilized backbone architectures in deep learning-based lightweight object detection methods have also been stated in which ShuffleNet and MobileNetV2 employed majorly in these models. Some critical aspects after analyzing current state-of-the-art deep learning-based lightweight object detection models on edge devices have been discussed. The comparison has been drawn between emerging lightweight object detection models on the basis of COCO-based mAP scores. The presentation of a summary of heterogeneous applications for lightweight object identification models that take into account diverse types of photos and application categories. This study also gives information on edge platforms for using portable detector models. A few recommendations are also given for creating a potent deep learning-based lightweight model, including multi-scale and multi-branch FPNs, federated learning, partitioning strategy, pruning, knowledge distillation, and label assignment algorithms. The lightweight detectors still fall more than 50% short in delivering such outcomes, although having demonstrated significant potential by matching classification errors with the thorough models.

Abou El Houda Z, Brik B, Ksentini A, Khoukhi L (2023) A MEC-based architecture to secure IOT applications using federated deep learning. IEEE Internet Things Mag 6(1):60–63

Article   Google Scholar  

Agarwal S, Terrail JOD, Jurie F (2018) Recent advances in object detection in the age of deep convolutional neural networks. arXiv preprint arXiv:1809.03193

Alfasly S, Liu B, Hu Y, Wang Y, Li CT (2019) Auto-zooming CNN-based framework for real-time pedestrian detection in outdoor surveillance videos. IEEE Access 7:105816–105826

Bai X, Zhou J (2020) Efficient semantic segmentation using multi-path decoder. Appl Sci 10(18):6386

Betti A, Tucci M (2023) YOLO-S: a lightweight and accurate YOLO-like network for small target detection in aerial imagery. Sensors 23(4):1865

Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

Brunetti A, Buongiorno D, Trotta GF, Bevilacqua V (2018) Computer vision and deep learning techniques for pedestrian detection and tracking: a survey. Neurocomputing 300:17–33

Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162)

Cai H, Gan C, Wang T, Zhang Z, Han S (2019) Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791

Cai Y, Li H, Yuan G, Niu W, Li Y, Tang X, Ren B, Wang Y (2021) Yolobile: real-time object detection on mobile devices via compression-compilation co-design. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 2, pp. 955–963)

Cao J, Bao W, Shang H, Yuan M, Cheng Q (2023) GCL-YOLO: a GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens 15(20):4932

Chabas JM, Chandra G, Sanchi G, Mitra M (2018) New demand, new markets: What edge computing means for hardware companies. https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/new-demand-new-markets-what-edge-computing-means-for-hardware-companies

Chang L, Zhang S, Du H, You Z, Wang S (2021) Position-aware lightweight object detectors with depthwise separable convolutions. J Real-Time Image Proc 18:857–871

Chen Y, Yang T, Zhang X, Meng G, Xiao X, Sun J (2019) Detnas: backbone search for object detection. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1903.10979

Chen L, Ding Q, Zou Q, Chen Z, Li L (2020b) DenseLightNet: a light-weight vehicle detection network for autonomous driving. IEEE Trans Industr Electron 67(12):10600–10609

Chen C, Yu J, Lin Y, Lai F, Zheng G, Lin Y (2023) Fire detection based on improved PP-YOLO. SIViP 17(4):1061–1067

Chen C, Liu M, Meng X, Xiao W, Ju Q (2020) Refinedetlite: a lightweight one-stage object detection framework for cpu-only devices. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 700–701)

Cheng Y, Li G, Wong N, Chen HB, Yu H (2020) DEEPEYE: a deeply tensor-compressed neural network for video comprehension on terminal devices. ACM Trans Embed Comput Syst (TECS) 19(3):1–25

Cho C, Choi W, Kim T (2020) Leveraging uncertainties in Softmax decision-making models for low-power IoT devices. Sensors 20(16):4603

Cui B, Dong XM, Zhan Q, Peng J, Sun W (2021) LiteDepthwiseNet: a lightweight network for hyperspectral image classification. IEEE Trans Geosci Remote Sens 60:1–15

Google Scholar  

Cui M, Gong G, Chen G, Wang H, Jin M, Mao W, Lu H (2023) LC-YOLO: a lightweight model with efficient utilization of limited detail features for small object detection. Appl Sci 13(5):3174

Dai Y, Liu W (2023) GL-YOLO-Lite: a novel lightweight fallen person detection model. Entropy 25(4):587

Dai W, Li D, Tang D, Jiang Q, Wang D, Wang H, Peng Y (2021) Deep learning assisted vision inspection of resistance spot welds. J Manuf Process 62:262–274

Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, 29

Detector AFO (2022) Fcos: a simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4)

Dey S, Mukherjee A (2018) Implementing deep learning and inferencing on fog and edge computing systems. In 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops) (pp. 818–823). IEEE

Ding P, Qian H, Chu S (2022) Slimyolov4: lightweight object detector based on yolov4. J Real-Time Image Proc 19(3):487–498

Ding C, Wang S, Liu N, Xu K, Wang Y, Liang Y (2019) REQ-YOLO: a resource-aware, efficient quantization framework for object detection on FPGAs. In proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 33–42)

Drolia U, Guo K, Narasimhan P (2017) Precog: prefetching for image recognition applications at the edge. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing (pp. 1–13)

Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6569–6578)

Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput vis 88:303–338

Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181

MathSciNet   Google Scholar  

Gadosey PK, Li Y, Agyekum EA, Zhang T, Liu Z, Yamak PT, Essaf F (2020) SD-UNET: stripping down U-net for segmentation of biomedical images on platforms with low computational budgets. Diagnostics 10(2):110

Gagliardi A, de Gioia F, Saponara S (2021) A real-time video smoke detection algorithm based on Kalman filter and CNN. J Real-Time Image Proc 18(6):2085–2095

Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430

Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237

Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448)

Guo W, Li W, Li Z, Gong W, Cui J, Wang X (2020) A slimmer network with polymorphic and group attention modules for more efficient object detection in aerial images. Remote Sens 12(22):3750

Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100

Han S, Yoo J, Kwon S (2019) Real-time vehicle-detection method in bird-view unmanned-aerial-vehicle imagery. Sensors 19(18):3958

Han S, Liu X, Han X, Wang G, Wu S (2020b) Visual sorting of express parcels based on multi-task deep learning. Sensors 20(23):6785

Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1580–1589)

Haque WA, Arefin S, Shihavuddin ASM, Hasan MA (2021) DeepThin: a novel lightweight CNN architecture for traffic sign recognition without GPU requirements. Expert Syst Appl 168:114481

He W, Huang Y, Fu Z, Lin Y (2020) Iconet: a lightweight network with greater environmental adaptivity. Symmetry 12(12):2119

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)

He K, Gkioxari G, Dollár P, and Girshick R (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969)

Hou Y, Li Q, Han Q, Peng B, Wang L, Gu X, Wang D (2021) MobileCrack: object classification in asphalt pavements using an adaptive lightweight deep learning. J Trans Eng Part b: Pavements 147(1):04020092

Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

Hu X, Yang W, Wen H, Liu Y, Peng Y (2021) A lightweight 1-D convolution augmented transformer with metric learning for hyperspectral image classification. Sensors 21(5):1751

Hu M, Li Z, Yu J, Wan X, Tan H, Lin Z (2023b) Efficient-lightweight yolo: improving small object detection in yolo for aerial images. Sensors 23(14):6423

Hu B, Wang Y, Cheng J, Zhao T, Xie Y, Guo X, Chen Y (2023) Secure and efficient mobile DNN using trusted execution environments. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security (pp. 274–285)

Hua H, Li Y, Wang T, Dong N, Li W, Cao J (2023) Edge computing with artificial intelligence: a machine learning perspective. ACM Comput Surv 55(9):1–35

Huang Z, Yang S, Zhou M, Gong Z, Abusorrah A, Lin C, Huang Z (2022) Making accurate object detection at the edge: review and new approach. Artif Intell Rev 55(3):2245–2274

Huang L, Yang Y, Deng Y, Yu Y (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874

Huang R, Pedoeem J, Chen C (2018) YOLO-LITE: a real-time object detection algorithm optimized for non-GPU computers. In 2018 IEEE international conference on big data (big data) (pp. 2503–2510). IEEE

Huang X, Wang X, Lv W, Bai X, Long X, Deng K, Dang Q, Han S, Liu Q, Hu X, Yu D (2021) PP-YOLOv2: a practical object detector. arXiv preprint arXiv:2104.10419

Huyan L, Bai Y, Li Y, Jiang D, Zhang Y, Zhou Q, Wei J, Liu J, Zhang Y, Cui T (2021) A lightweight object detection framework for remote sensing images. Remote Sens 13(4):683

Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360

Isereau D, Capraro C, Cote E, Barnell M, Raymond C (2017) Utilizing high-performance embedded computing, agile condor, for intelligent processing: An artificial intelligence platform for remotely piloted aircraft. In 2017 Intelligent Systems Conference (IntelliSys) (pp. 1155–1159). IEEE

Jain DK, Zhao X, González-Almagro G, Gan C, Kotecha K (2023) Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Inf Fus 95:401–414

Jeong M, Park M, Nam J, Ko BC (2020) Light-weight student LSTM for real-time wildfire smoke detection. Sensors 20(19):5508

Jiang S, Li H, Jin Z (2021) A visually interpretable deep learning framework for histopathological image-based skin cancer diagnosis. IEEE J Biomed Health Inform 25(5):1483–1494

Jiang L, Nie W, Zhu J, Gao X, Lei B (2022) Lightweight object detection network model suitable for indoor mobile robots. J Mech Sci Technol 36(2):907–920

Jiang Y, Li W, Zhang J, Li F, Wu Z (2023) YOLOv4-dense: a smaller and faster YOLOv4 for real-time edge-device based object detection in traffic scene. IET Image Proc 17(2):570–580

Jiang Z, Zhao L, Li S, Jia Y (2020) Real-time object detection method based on improved YOLOv4-tiny. arXiv preprint arXiv:2011.04244

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R (2019) A survey of deep learning-based object detection. IEEE Access 7:128837–128868

Jin R, Lin D (2019) Adaptive anchor for fast object detection in aerial image. IEEE Geosci Remote Sens Lett 17(5):839–843

Jin Y, Cai J, Xu J, Huan Y, Yan Y, Huang B, Guo Y, Zheng L, Zou Z (2021) Self-aware distributed deep learning framework for heterogeneous IoT edge devices. Futur Gener Comput Syst 125:908–920

Kamal KC, Yin Z, Wu M, Wu Z (2019) Depthwise separable convolution architectures for plant disease classification. Comput Electron Agric 165:104948

Kamath V, Renuka A (2023) Deep learning based object detection for resource constrained devices: systematic review, future trends and challenges ahead. Neurocomputing 531:34–60

Kang H, Zhou H, Wang X, Chen C (2020) Real-time fruit recognition and grasping estimation for robotic apple harvesting. Sensors 20(19):5670

Ke X, Lin X, Qin L (2021) Lightweight convolutional neural network-based pedestrian detection and re-identification in multiple scenarios. Mach vis Appl 32:1–23

Kim W, Jung WS, Choi HK (2019) Lightweight driver monitoring system based on multi-task mobilenets. Sensors 19(14):3200

Kim K, Jang SJ, Park J, Lee E, Lee SS (2023) Lightweight and energy-efficient deep learning accelerator for real-time object detection on edge devices. Sensors 23(3):1185

Kim KH, Hong S, Roh B, Cheon Y, and Park M (2016) Pvanet: Deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021 .

Kondaveeti HK, Kumaravelu NK, Vanambathina SD, Mathe SE, Vappangi S (2021) A systematic literature review on prototyping with Arduino: applications, challenges, advantages, and limitations. Comput Sci Rev 40:100364

Kong T, Sun F, Liu H, Jiang Y, Li L, Shi J (2020a) Foveabox: beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398

Kong Z, Xiong F, Zhang C, Fu Z, Zhang M, Weng J, Fan M (2020b) Automated maxillofacial segmentation in panoramic dental X-ray images using an efficient encoder-decoder network. IEEE Access 8:207822–207833

Koubaa A, Ammar A, Kanhouch A, AlHabashi Y (2021) Cloud versus edge deployment strategies of real-time face recognition inference. IEEE Trans Netw Sci Eng 9(1):143–160

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 25

Kyrkou C (2020) YOLOpeds: efficient real-time single-shot pedestrian detection for smart camera applications. IET Comput Vision 14(7):417–425

Kyrkou C (2021) C 3 Net: end-to-end deep learning for efficient real-time visual active camera control. J Real-Time Image Proc 18(4):1421–1433

Kyrkou C, Theocharides T (2020) EmergencyNet: efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion. IEEE J Sel Top Appl Earth Observ Remote Sens 13:1687–1699

Lai CY, Wu BX, Shivanna VM, Guo JI (2021) MTSAN: multi-task semantic attention network for ADAS applications. IEEE Access 9:50700–50714

Lan H, Meng J, Hundt C, Schmidt B, Deng M, Wang X, Liu W, Qiao Y, Feng S (2019) FeatherCNN: fast inference computation with TensorGEMM on ARM architectures. IEEE Trans Parallel Distrib Syst 31(3):580–594

Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV) (pp. 734–750)

Law H, Teng Y, Russakovsky O, Deng J (2019) Cornernet-lite: efficient keypoint based object detection. arXiv preprint arXiv:1904.08900

Li J, Ye J (2023) Edge-YOLO: lightweight infrared object detection method deployed on edge devices. Appl Sci 13(7):4402

Li X, Wang W, Wu L, Chen S, Hu X, Li J, Tang J, Yang J (2020a) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv Neural Inf Process Syst 33:21002–21012

Li P, Han L, Tao X, Zhang X, Grecos C, Plaza A, Ren P (2020b) Hashing nets for hashing: a quantized deep learning to hash framework for remote sensing image retrieval. IEEE Trans Geosci Remote Sens 58(10):7331–7345

Li Y, Li M, Qi J, Zhou D, Zou Z, Liu K (2021a) Detection of typical obstacles in orchards based on deep convolutional neural network. Comput Electron Agric 181:105932

Li Z, Liu X, Zhao Y, Liu B, Huang Z, Hong R (2021b) A lightweight multi-scale aggregated model for detecting aerial images captured by UAVs. J vis Commun Image Represent 77:103058

Li C, Fan Y, Cai X (2021c) PyConvU-Net: a lightweight and multiscale network for biomedical image segmentation. BMC Bioinf 22:1–11

Li T, Wang J, Zhang T (2022a) L-DETR: a light-weight detector for end-to-end object detection with transformers. IEEE Access 10:105685–105692

Li S, Yang Z, Nie H, Chen X (2022b) Corn disease detection based on an improved YOLOX-Tiny network model. Int J Cognit Inform Nat Intell (IJCINI) 16(1):1–8

Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5325–5334)

Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2017) Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264

Li Y, Li J, Lin W, Li J (2018) Tiny-DSOD: lightweight object detection for resource-restricted usages. arXiv preprint arXiv:1807.11013

Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6054–6063)

Liang L, Wang G (2021) Efficient recurrent attention network for remote sensing scene classification. IET Image Proc 15(8):1712–1721

Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13 (pp. 740–755). Springer International Publishing

Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988)

Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)

Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020a) Deep learning for generic object detection: a survey. Int J Comput Vision 128:261–318

Liu X, Liu B, Liu G, Chen F, Xing T (2020b) Mobileaid: a fast and effective cognitive aid system on mobile devices. IEEE Access 8:101923–101933

Liu J, Li Q, Cao R, Tang W, Qiu G (2020c) MiniNet: an extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation. ISPRS J Photogramm Remote Sens 166:255–267

Liu X, Li Y, Shuang F, Gao F, Zhou X, Chen X (2020d) ISSD: improved SSD for insulator and spacer online detection based on UAV system. Sensors 20(23):6961

Liu Y, Sun P, Wergeles N, Shang Y (2021a) A survey and performance evaluation of deep learning methods for small object detection. Expert Syst Appl 172:114602

Liu S, Guo B, Ma K, Yu Z, Du J (2021b) AdaSpring: context-adaptive and runtime-evolutionary deep model compression for mobile applications. Proc ACM Interact Mobile Wearable Ubiquitous Technol 5(1):1–22

Liu Z, Ma J, Weng J, Huang F, Wu Y, Wei L, Li Y (2021c) LPPTE: a lightweight privacy-preserving trust evaluation scheme for facilitating distributed data fusion in cooperative vehicular safety applications. Inf Fus 73:144–156

Liu Y, Zhang C, Wu W, Zhang B, Zhou F (2022a) MiniYOLO: a lightweight object detection algorithm that realizes the trade-off between model size and detection accuracy. Int J Intell Syst 37(12):12135–12151

Liu T, Wang J, Huang X, Lu Y, Bao J (2022b) 3DSMDA-Net: an improved 3DCNN with separable structure and multi-dimensional attention for welding status recognition. J Manuf Syst 62:811–822

Liu S, Huang D (2018) Receptive field block net for accurate and fast object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 385–400)

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21–37). Springer International Publishing

Long F (2020) Microscopy cell nuclei segmentation with enhanced U-Net. BMC Bioinf 21(1):8

Long ZHOU, Suyuan W, Zhongma CUI, Jiaqi FANG, Xiaoting YANG, Wei D (2020b) Lira-YOLO: a lightweight model for ship detection in radar images. J Syst Eng Electron 31(5):950–956

Long X, Deng K, Wang G, Zhang Y, Dang Q, Gao Y, Shen H, Ren J, Han S, Ding E, Wen S (2020) PP-YOLO: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099

Lu Y, Zhang L, and Xie W (2020) YOLO-compact: an efficient YOLO network for single category real-time object detection. In 2020 Chinese control and decision conference (CCDC) (pp. 1931–1936). IEEE

Luo X, Zhu J, Yu Q (2019) Efficient convNets for fast traffic sign recognition. IET Intel Transport Syst 13(6):1011–1015

Ma N, Yu X, Peng Y, Wang S (2019) A lightweight hyperspectral image anomaly detector for real-time mission. Remote Sens 11(13):1622

Ma M, Ma W, Jiao L, Liu X, Li L, Feng Z, Yang S (2023) A multimodal hyper-fusion transformer for remote sensing image classification. Inf Fus 96:66–79

Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In Proceedings of the European conference on computer vision (ECCV) (pp. 116–131)

Makantasis K, Karantzalos K, Doulamis A, Doulamis N (2015) Deep supervised learning for hyperspectral data classification through convolutional neural networks. In 2015 IEEE international geoscience and remote sensing symposium (IGARSS) (pp. 4959–4962). IEEE

Makkar A, Ghosh U, Rawat DB, Abawajy JH (2021) Fedlearnsp: preserving privacy and security using federated learning and edge computing. IEEE Consumer Electron Mag 11(2):21–27

Mansouri SS, Kanellakis C, Kominiak D, Nikolakopoulos G (2020) Deploying MAVs for autonomous navigation in dark underground mine environments. Robot Auton Syst 126:103472

Mao QC, Sun HM, Liu YB, Jia RS (2019) Mini-YOLOv3: real-time object detector for embedded applications. IEEE Access 7:133529–133538

Mehta S, Rastegari M (2021) Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178

Mittal P, Singh R, Sharma A (2020) Deep learning-based object detection in low-altitude UAV datasets: a survey. Image vis Comput 104:104046

Muhammad K, Hussain T, Del Ser J, Palade V, De Albuquerque VHC (2019) DeepReS: a deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Trans Industr Inf 16(9):5938–5947

Nguyen HD, Na IS, Kim SH, Lee GS, Yang HJ, Choi JH (2019) Multiple human tracking in drone image. Multimedia Tools Appl 78:4563–4577

Nguyen TV, Tran AT, Dao NN, Moon H, Cho S (2023) Information fusion on delivery: a survey on the roles of mobile edge caching systems. Inf Fus 89:486–509

Ogden SS, Guo T (2019) Characterizing the deep neural networks inference performance of mobile applications. arXiv preprint arXiv:1909.04783

Ophoff T, Van Beeck K, Goedemé T (2019) Exploring RGB+ Depth fusion for real-time object detection. Sensors 19(4):866

Ouyang Z, Niu J, Liu Y, Guizani M (2019) Deep CNN-based real-time traffic light detector for self-driving vehicles. IEEE Trans Mob Comput 19(2):300–313

Paluru N, Dayal A, Jenssen HB, Sakinis T, Cenkeramaddi LR, Prakash J, Yalavarthy PK (2021) Anam-Net: anamorphic depth embedding-based lightweight CNN for segmentation of anomalies in COVID-19 chest CT images. IEEE Trans Neural Netw Learn Syst 32(3):932–946

Panero Martinez R, Schiopu I, Cornelis B, Munteanu A (2021) Real-time instance segmentation of traffic videos for embedded devices. Sensors 21(1):275

Pang J, Li C, Shi J, Xu Z, and Feng H (2019) R2-CNN: fast tiny object detection in large-scale remote sensing images. arXiv 2019. arXiv preprint arXiv:1902.06042

Paoletti ME, Haut JM, Pereira NS, Plaza J, Plaza A (2021) Ghostnet for hyperspectral image classification. IEEE Trans Geosci Remote Sens 59(12):10378–10393

Picron C, Tuytelaars T (2021) Trident pyramid networks: the importance of processing at the feature pyramid level for better object detection. arXiv preprint arXiv:2110.04004

Ping P, Huang C, Ding W, Liu Y, Chiyomi M, Kazuya T (2023) Distracted driving detection based on the fusion of deep learning and causal reasoning. Inf Fus 89:121–142

Qian S, Ning C, Hu Y (2021) MobileNetV3 for image classification. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (pp. 490–497). IEEE

Qin Z, Li Z, Zhang Z, Bao Y, Yu G, Peng Y, Sun J (2019) ThunderNet: towards real-time generic object detection on mobile devices. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6718–6727)

Qin S, Liu S (2020) Efficient and unified license plate recognition via lightweight deep neural network. IET Image Proc 14(16):4102–4109

Quang TN, Lee S, Song BC (2021) Object detection using improved bi-directional feature pyramid network. Electronics 10(6):746

Ran X, Chen H, Liu Z, Chen J (2017) Delivering deep learning to mobile devices via offloading. In Proceedings of the Workshop on Virtual Reality and Augmented Reality Network (pp. 42–47)

Rani E (2021) LittleYOLO-SPP: a delicate real-time vehicle detection algorithm. Optik 225:165818

Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263–7271)

Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788)

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. https://doi.org/10.1109/TPAMI.2016.2577031

Ren J, Guo Y, Zhang D, Liu Q, Zhang Y (2018) Distributed and efficient object detection in edge computing: challenges and solutions. IEEE Netw 32(6):137–143

Rodriguez-Conde I, Campos C, Fdez-Riverola F (2021) On-device object detection for more efficient and privacy-compliant visual perception in context-aware systems. Appl Sci 11(19):9173

Rui Z, Zhaokui W, Yulin Z (2019) A person-following nanosatellite for in-cabin astronaut assistance: system design and deep-learning-based astronaut visual tracking implementation. Acta Astronaut 162:121–134

Saidi A, Othman SB, Dhouibi M, Saoud SB (2021) FPGA-based implementation of classification techniques: a survey. Integration 81:280–299

Samore A, Rusci M, Lazzaro D, Melpignano P, Benini L, Morigi S (2020) BrightNet: a deep CNN for OLED-based point of care immunofluorescent diagnostic systems. IEEE Trans Instrum Meas 69(9):6766–6775

Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520)

Sharma VK, Mir RN (2020) A comprehensive and systematic look up into deep learning based object detection techniques: a review. Comput Sci Rev 38:100301

Article   MathSciNet   Google Scholar  

Shi C, Wang T, Wang L (2020) Branch feature fusion convolution network for remote sensing scene classification. IEEE J Sel Top Appl Earth Observ Remote Sens 13:5194–5210

Shoeibi A, Khodatars M, Jafari M, Ghassemi N, Moridian P, Alizadehsani R, Ling SH, Khosravi A, Alinejad-Rokny H, Lam HK, Fuller-Tyszkiewicz M (2023) Diagnosis of brain diseases in fusion of neuroimaging modalities using deep learning: a review. Inf Fus 93:85–117

Silva SH, Rad P, Beebe N, Choo KKR, Umapathy M (2019) Cooperative unmanned aerial vehicles with privacy preserving deep vision for real-time object identification and tracking. J Parallel Distrib Comput 131:147–160

Song S, Jing J, Huang Y, Shi M (2021) EfficientDet for fabric defect detection based on edge computing. J Eng Fibers Fabr 16:15589250211008346

Steimle F, Wieland M, Mitschang B, Wagner S, Leymann F (2017) Extended provisioning, security and analysis techniques for the ECHO health data management system. Computing 99:183–201

Subedi P, Hao J, Kim IK, Ramaswamy L (2021) AI multi-tenancy on edge: concurrent deep learning model executions and dynamic model placements on edge devices. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD) (pp. 31–42). IEEE

Sun Y, Pan B, Fu Y (2021) Lightweight deep neural network for real-time instrument semantic segmentation in robot assisted minimally invasive surgery. IEEE Robot Autom Lett 6(2):3870–3877

Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le QV (2019) Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820–2828)

Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10790)

Tang Q, Li J, Shi Z, Hu Y (2020) Lightdet: a lightweight and accurate object detection network. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2243–2247). IEEE

Tang Z, Liu X, Shen G, and Yang B (2020) Penet: object detection using points estimation in aerial images. arXiv preprint arXiv:2001.08247 .

Tsai WC, Lai JS, Chen KC, Shivanna V, Guo JI (2021) A lightweight motional object behavior prediction system harnessing deep learning technology for embedded adas applications. Electronics 10(6):692

Tzelepi M, Tefas A (2020) Improving the performance of lightweight CNNs for binary classification using quadratic mutual information regularization. Pattern Recogn 106:107407

Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104:154–171

Ullah A, Muhammad K, Ding W, Palade V, Haq IU, Baik SW (2021) Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Appl Soft Comput 103:107102

Véstias MP, Duarte RP, de Sousa JT, Neto HC (2020) Moving deep learning to the edge. Algorithms 13(5):125

Wang RJ, Li X, Ling CX (2018) Pelee: a real-time object detection system on mobile devices. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1804.06882

Wang X, Han Y, Leung VC, Niyato D, Yan X, Chen X (2020a) Convergence of edge computing and deep learning: a comprehensive survey. IEEE Commun Surv Tutor 22(2):869–904

Wang F, Xie F, Shen S, Huang L, Sun R, Le Yang J (2020c) A novel multiface recognition method with short training time and lightweight based on ABASNet and H-softmax. IEEE Access 8:175370–175384

Wang T, Wang P, Cai S, Zheng X, Ma Y, Jia W, Wang G (2021a) Mobile edge-enabled trust evaluation for the Internet of Things. Inf Fus 75:90–100

Wang J, Huang R, Guo S, Li L, Zhu M, Yang S, Jiao L (2021c) NAS-guided lightweight multiscale attention fusion network for hyperspectral image classification. IEEE Trans Geosci Remote Sens 59(10):8754–8767

Wang D, Ren J, Wang Z, Zhang Y, Shen XS (2022a) PrivStream: a privacy-preserving inference framework on IoT streaming data at the edge. Inf Fus 80:282–294

Wang G, Ding H, Li B, Nie R, Zhao Y (2022b) Trident-YOLO: improving the precision and speed of mobile device object detection. IET Image Proc 16(1):145–157

Wang Y, Wang J, Zhang W, Zhan Y, Guo S, Zheng Q, Wang X (2022c) A survey on deploying mobile deep learning applications: a systemic and technical perspective. Digit Commun Netw 8(1):1–17

Wang X, Zhao Q, Jiang P, Zheng Y, Yuan L, Yuan P (2022d) LDS-YOLO: a lightweight small object detection method for dead trees from shelter forest. Comput Electron Agric 198:107035

Wang C, Wang Z, Li K, Gao R, Yan L (2023b) Lightweight object detection model fused with feature pyramid. Multimedia Tools Appl 82(1):601–618

Wang K, Liew JH, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9197–9206)

Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH (2020) CSPNet: a new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 390–391).

Wang CY, Bochkovskiy A, Liao HYM (2021) Scaled-yolov4: scaling cross stage partial network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13029–13038)

Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7464–7475)

Wu Q, Wang H, Liu Y, Zhang L, Gao X (2019) SAT: single-shot adversarial tracker. IEEE Trans Industr Electron 67(11):9882–9892

Wu X, Sahoo D, Hoi SC (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64

Wu Y, Feng S, Huang X, Wu Z (2021) L4Net: an anchor-free generic object detector with attention mechanism for autonomous driving. IET Comput Vision 15(1):36–46

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X (2020) A review of object detection based on deep learning. Multimedia Tools Appl 79:23729–23791

Xu D, Wu Y (2021) FE-YOLO: a feature enhancement network for remote sensing target detection. Remote Sens 13(7):1311

Xu Z, Liu W, Huang J, Yang C, Lu J, Tan H (2020) Artificial intelligence for securing IoT services in edge computing: a survey. Secur Commun Netw 2020(1):8872586

Xu C, Zhu G, Shu J (2021) A lightweight and robust lie group-convolutional neural networks joint representation for remote sensing scene classification. IEEE Trans Geosci Remote Sens 60:1–15

Xu M, Liu J, Liu Y, Lin F X, Liu Y, Liu X (2019) A first look at deep learning apps on smartphones. In The World Wide Web Conference (pp. 2125–2136)

Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K, Wang G, Dang Q, Wei S, Du Y, Lai B (2022) PP-YOLOE: an evolved version of YOLO. arXiv preprint arXiv:2203.16250

Yang Z, Rothkrantz, L (2011) Surveillance system using abandoned object detection. In Proceedings of the 12th international conference on computer systems and technologies (pp. 380–386)

Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: point set representation for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9657–9666)

Yi Z, Yongliang S, Jun Z (2019) An improved tiny-yolov3 pedestrian detection algorithm. Optik 183:17–23

Yin R, Zhao W, Fan X, Yin Y (2020) AF-SSD: an accurate and fast single shot detector for high spatial remote sensing imagery. Sensors 20(22):6530

Yin T, Chen W, Liu B, Li C, Du L (2023) Light “You Only Look Once”: an improved lightweight vehicle-detection model for intelligent vehicles under dark conditions. Mathematics 12(1):124

Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520)

Yu G, Chang Q, Lv W, Xu C, Cui C, Ji W, Dang Q, Deng K, Wang G, Du Y, Lai B, Ma Y (2021) PP-PicoDet: a better real-time object detector on mobile devices. arXiv preprint arXiv:2111.00902

Yuan F, Zhang L, Wan B, Xia X, Shi J (2019) Convolutional neural networks based on multi-scale additive merging layers for visual smoke recognition. Mach vis Appl 30:345–358

Zaidi S, Ansari SA, Aslam MS, Kanwal N, Asghar M, Lee B (2022) A survey of modern deep learning based object detection models. Digit Sig Process 126:103514

Zhang S, Wang X, Lei Z, Li SZ (2019a) Faceboxes: a CPU real-time and accurate unconstrained face detector. Neurocomputing 364:297–309

Zhang Y, Liu M, Chen Y, Zhang H, Guo Y (2019b) Real-time vision-based system of fault detection for freight trains. IEEE Trans Instrum Meas 69(7):5274–5284

Zhang X, Lin X, Zhang Z, Dong L, Sun X, Sun D, Yuan K (2020b) Artificial intelligence medical ultrasound equipment: application of breast lesions detection. Ultrason Imaging 42(4–5):191–202

Zhang S, Li Y, Liu X, Guo S, Wang W, Wang J, Ding B, Wu D (2020c) Towards real-time cooperative deep inference over the cloud and edge end devices. Proc ACM Interact Mobile Wearable Ubiquitous Technol 4(2):1–24

Zhang Y, Zhang H, Huang Q, Han Y, Zhao M (2024) DsP-YOLO: an anchor-free network with DsPAN for small object detection of multiscale defects. Expert Syst Appl 241:122669

Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4203–4212)

Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856)

Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9759–9768)

Zhao ZQ, Zheng P, Xu ST, Wu X (2019) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 30(11):3212–3232

Zhao H, Zhou Y, Zhang L, Peng Y, Hu X, Peng H, Cai X (2020a) Mixed YOLOv3-LITE: a lightweight real-time object detection method. Sensors 20(7):1861

Zhao Z, Zhang Z, Xu X, Xu Y, Yan H, Zhang L (2020b) A lightweight object detection network for real-time detection of driver handheld call on embedded devices. Comput Intell Neurosci 2020(1):6616584

Zhao Y, Yin Y, Gui G (2020c) Lightweight deep learning based intelligent edge surveillance techniques. IEEE Trans Cognit Commun Netw 6(4):1146–1154

Zheng G, Chai WK, Duanmu JL, Katos V (2023) Hybrid deep learning models for traffic prediction in large-scale road networks. Inf Fus 92:93–114

Zhou Y (2024) A YOLO-NL object detector for real-time detection. Expert Syst Appl 238:122256

Zhou T, Fan DP, Cheng MM, Shen J, Shao L (2021a) RGB-D salient object detection: a survey. Comput Visual Media 7:37–69

Zhou X, Li X, Hu K, Zhang Y, Chen Z, Gao X (2021b) ERV-Net: an efficient 3D residual neural network for brain tumor segmentation. Expert Syst Appl 170:114566

Zhou L, Rao X, Li Y, Zuo X, Qiao B, Lin Y (2022) A lightweight object detection method in aerial images based on dense feature fusion path aggregation network. ISPRS Int J Geo Inf 11(3):189

Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850

Zhou X, Zhuo J, Krahenbuhl P (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 850–859)

Zhou L, Wei S, Cui Z, Ding W (2019) YOLO-RD: a lightweight object detection network for range doppler radar images. In IOP Conference Series: Materials Science and Engineering (Vol. 563, No. 4, p. 042027). IOP Publishing

Zhu Z, He X, Qi G, Li Y, Cong B, Liu Y (2023) Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI. Inf Fus 91:376–387

Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13 (pp. 391–405). Springer International Publishing

Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proc IEEE 111(3):257–276

Download references

Author information

Authors and affiliations.

CSED, Thapar Institute of Engineering & Technology, Patiala, India

Payal Mittal

You can also search for this author in PubMed   Google Scholar

Contributions

I, Payal Mittal is the sole author of this manuscript.

Corresponding author

Correspondence to Payal Mittal .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Consent for publication

During the preparation of this work the author has not used Generative AI and AI-assisted technologies in writing of this manuscript. The author solely reviewed and edited the content manually as needed and takes full responsibility for the content of the publication.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Mittal, P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif Intell Rev 57 , 242 (2024). https://doi.org/10.1007/s10462-024-10877-1

Download citation

Accepted : 25 July 2024

Published : 10 August 2024

DOI : https://doi.org/10.1007/s10462-024-10877-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Lightweight networks
  • Object detection
  • Computer vision
  • Edge devices
  • Computing power
  • Find a journal
  • Publish with us
  • Track your research

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow Object Detection API - Access Violation (Windows 10) - Nvidia RTX 2080 #10436

@tombstone

muyigl95 commented Dec 25, 2021

Hello,

i want to train my model using the TensorFlow Object Detection API. I used the following site as a guide:

When i execute „model_main_tf2.py“, I receive following error message:

application error:

It only happens, when i train with the GPU. That means i have a problem with my Nvidia RTX 2080. I tested my installation with the „model_builder_tf2_test.py“ from /models/research/. It completed successfully. When I installed CUDA, i got this message after the installation:

not installed:

Could the last two messages cause the problem?

pipeline.config:

My versions:

nvidia-smi:

nvcc --version:

If more information is needed, please let me know!

@muyigl95

muyigl95 commented Jan 2, 2022

Hello,

i fixed the problem. It wasn't the GPU.

It was the Virtual Memory (Pagefile). I changed to "System managed size"
Here is the link for setting the Virtual Memory:

Greetings
muyigl95

Sorry, something went wrong.

@google-ml-butler

google-ml-butler bot commented Jan 2, 2022

Are you satisfied with the resolution of your issue?

No branches or pull requests

@tombstone

IMAGES

  1. Pipeline for developing the TensorFlow Lite Object Detection Model

    research paper on object detection using tensorflow

  2. [PDF] Mobile Object Detection using TensorFlow Lite and Transfer

    research paper on object detection using tensorflow

  3. Edureka Tensorflow Object Detection Realtime Object Detection With

    research paper on object detection using tensorflow

  4. AI Robot

    research paper on object detection using tensorflow

  5. TensorFlow Object Detection Flowchart

    research paper on object detection using tensorflow

  6. Tensorflow Object Detection Tutorial

    research paper on object detection using tensorflow

COMMENTS

  1. (PDF) Real Object Detection Using TensorFlow

    For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% ...

  2. Object Detection Using TensorFlow

    As deep learning models and AI technologies continue to progress, there is a huge potential to enhance the precision, capability, and performance even more of real-time object detection methods. This research paper presents an object detection approach using the TensorFlow framework and demonstrates its effectiveness and potential for practical ...

  3. Object Detection using TensorFlow

    Object Detection using TensorFlow Abstract: Objects in the home that are often used tend to follow specific patterns in terms of time and location. Analyzing these trends can help us keep track of our belongings and increase efficiency by reducing the amount of time wasted forgetting or looking for them. Tensor Flow, a relatively new framework ...

  4. Object Detection Using TensorFlow

    This research paper presents an object detection approach using the TensorFlow framework and demonstrates its effectiveness and potential for practical applications. Discover the world's research ...

  5. (PDF) Tensor flow object detection

    Abstract. Tensor flow object detection Creating accurate machine learning models capable of localizing and identifying multiple objects in a single image remains a core challenge in computer ...

  6. PDF Real Object Detection Using TensorFlow

    Real Object Detection Using TensorFlow Milind Rane, Aseem Patil and Bhushan Barse ... then TensorFlow comes in handy. This paper presents a new method for obstacle detection with a single webcam camera. It also presents a ... research board, Washington, D.C. 5. Girshick R (2015) Fast R-CNN. Comput Sci 6. Dai J, Li Y, He K et al (2016) R-FCN ...

  7. Object Detection and Count of Objects in Image using Tensor Flow Object

    Object Detection is widely utilized in several applications such as detecting vehicles, face detection, autonomous vehicles and pedestrians on streets. TensorFlow's Object Detection API is a powerful tool that can quickly enable anyone to build and deploy powerful image recognition software. Object detection not solely includes classifying and recognizing objects in an image however ...

  8. Object Detection for Autonomous Vehicle Using TensorFlow

    Object detection is the blooming research area in the field of computer vision. The ability to identify and recognize objects either in single or more than one image frame can gain extreme importance in various ways as while driving the vehicle, the driver cannot identify objects properly due to the dearth of attention, reflection of light, anonymous objects etc. which may lead to fatal accidents.

  9. Traffic Light Detection Using Tensorflow Object Detection Framework

    This paper presents a deep learning approach for robust detection of traffic light by comparing two object detection models and by evaluating the flexibility of the TensorFlow Object Detection Framework to solve the real-time problems. They include Single Shot Multibox Detector (SSD) MobileNet V2 and Faster-RCNN. ...

  10. PDF Real-Time Object Detection using TensorFlow

    The model is trained to detect objects in real-time. This can be best achieved through a universal and open source library-TensorFlow (TF). Within the TF environment, multiple algorithms can be used for a wide range of datasets. In this paper, we have made use of the CIFAR-10 dataset, objects seen on a daily basis. 2. BASIC CNN COMPONENTS

  11. [PDF] Mobile Object Detection using TensorFlow Lite and Transfer

    Mobile Object Detection using TensorFlow Lite and Transfer Learning. Oscar Alsing. Published 2018. Computer Science, Engineering. TLDR. This research presents a novel approach called "supervised learning" that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging objects in an image.

  12. models/research/object_detection/README.md at master · tensorflow

    The TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. At Google we've certainly found this codebase to be useful for our computer vision needs, and we hope that you will as well. If you use the TensorFlow Object Detection API for a ...

  13. Sign Language Recognition System using TensorFlow Object Detection API

    The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign ...

  14. (PDF) A review: Comparison of performance metrics of ...

    PDF | On Jun 30, 2020, S A Sanchez and others published A review: Comparison of performance metrics of pretrained models for object detection using the TensorFlow framework | Find, read and cite ...

  15. Object detection with Model Garden

    The implementations demonstrate the best practices for modeling, letting users to take full advantage of TensorFlow for their research and product development. This tutorial demonstrates how to: Use models from the Tensorflow Model Garden(TFM) package. Fine-tune a pre-trained RetinanNet with ResNet-50 as backbone for object detection.

  16. PDF Object Detection and Recognition Using Tensorflow for Blind People

    es.ASSISTIVE OBJECT FINDING SYSTEM FOR VISUALLY IMPAIRED PEOPLE In 2020 ,"Assistive object Recognition/finding System for visually impaired" The issue of visual impair or blind people is faced worldwide, for this a solution was proposed where two cameras placed on blind perso. glasses, GPS free service, and ultrasonic s.

  17. Object Recognition using TensorFlow

    Computers can apply vision technologies using cameras and artificial intelligence software to achieve image recognition and identify objects, places, and people. The objective of this project is to capture the image of an automobile as it drives by, identify its model and color, and determine its location, travel direction, and speed. This system can be used to assist law enforcement with ...

  18. Object Detection

    This Colab demonstrates use of a TF-Hub module trained to perform object detection. Setup Imports and function definitions. Toggle code # For running inference on the TF-Hub module. import tensorflow as tf import tensorflow_hub as hub # For downloading the image. import matplotlib.pyplot as plt import tempfile from six.moves.urllib.request import urlopen from six import BytesIO # For drawing ...

  19. Smart Hat for the blind with Real-Time Object Detection using Raspberry

    cessing speed of object detection by introducing the latest Raspberry Pi 4 module, which is more powerful than the pre-vious versions. As a result, the Single-Shot Multibox Detector MobileNet v2 convolutional neural network on Raspberry Pi 4 using TensorFlow Lite 2, is employed for object detec-tion. A model called SSD MobileNet v2 320x320 ...

  20. PDF Real Time Object Detection and Voice Assistance for Blind Using Tensorflow

    Single Shot Multi-Box Detection is a speedier discovery approach for continuous item recognition, in view of a convolution brain network model proposed in this paper (SSD). The element resampling stage was dispensed with in this work, and all determined outcomes were converged into a solitary part. In any case, a light-weight network model is ...

  21. Object Detection and Pattern Tracking Using TensorFlow

    The TensorFlow Object Detection API is used to detect multiple objects in real-time video streams. We then introduce an algorithm to detect patterns and alert the user if an anomaly is found. We consider the research presented by Laube et al., Finding REMO-detecting relative motion patterns in geospatial lifelines, 201-214, (2004)[1].

  22. Real-Time Object Detection with Tensorflow Model Using ...

    Sep 2023. Shermila Crespo. ... In "Real-Time Object Detection with TensorFlow Model Using Edge Computing Architecture," N et al. [4] propose a real-time object detection system using a TensorFlow ...

  23. MPE-YOLO: enhanced small target detection in aerial imaging

    In the field of medical image processing, typical object detection algorithms include: Pacal et al. 27 demonstrated that by improving the YOLO algorithm and using the latest data augmentation and ...

  24. JMSE

    further adapted the joint-training scheme of the Faster R-CNN framework from Caffe to TensorFlow, providing a baseline implementation for object detection. The YOLO algorithm was presented as a new method for object detection in Ref. [ 23 ], which applied a regression problem to frame detection to identify bounding boxes and confidence levels.

  25. A comprehensive survey of deep learning-based lightweight object

    This study concentrates on deep learning-based lightweight object detection models on edge devices. Designing such lightweight object recognition models is more difficult than ever due to the growing demand for accurate, quick, and low-latency models for various edge devices. The most recent deep learning-based lightweight object detection methods are comprehensively described in this work ...

  26. Object Detection Using Deep Learning, CNNs and Vision Transformers: A

    Detecting objects remains one of computer vision and image understanding applications' most fundamental and challenging aspects. Significant advances in object detection have been achieved through improved object representation and the use of deep neural network models. This paper examines more closely how object detection has evolved in the era of deep learning over the past years. We ...

  27. TensorFlow Object Detection API

    > desired, use `tf.data.Options.deterministic`. WARNING:tensorflow:From > C:\Users\Ameise\anaconda3\envs\tensorflow\lib\site-packages\object_detection\builders\dataset_builder.py:236: > DatasetV1.map_with_legacy_function (from > tensorflow.python.data.ops.dataset_ops) is deprecated and will be > removed in a future version.