A COMPREHENSIVE REVIEW OF OBJECT DETECTION IN ANIMAL AND PLANT USING VISION TRANSFORMER

Maolan Lin, Zhenchang Gao, Honghao Cai, Wenliang Liao

doi:https://doi.org/10.36899/JAPS.2026.2.0027

Article Abstract

Full Text

Volume 36, No. (2), 2026 (April)

A COMPREHENSIVE REVIEW OF OBJECT DETECTION IN ANIMAL AND PLANT USING VISION TRANSFORMER

Maolan Lin, Zhenchang Gao, Honghao Cai, Wenliang Liao

M. Lin¹, Z. Gao², H. Cai³*, W. Liao⁴

¹ 1. Department of Physics, School of Science, Jimei University, Xiamen, Fujian Province, China,
² 2. School of Information Science and Technology, ShanghaiTech University, Shanghai, China 3. Guangdong Institution of Intelligent Science and Technology, Zhuhai, Guangdong Province, China,
³ 1. Department of Physics, School of Science, Jimei University, Xiamen, Fujian Province, China,
⁴ 1. Department of Physics, School of Science, Jimei University, Xiamen, Fujian Province, China,

Corresponding Author: hhcai@jmu.edu.cn

DOI: https://doi.org/10.36899/JAPS.2026.2.0027

Page Number(s): 321-330

Published Online First: December 25, 2025

Publication Date: February 28, 2026

ABSTRACT

In digital farming, computers serve as primary sensing eyes and object detection is the core vision task that locates and counts the target objects of interest, i.e., plants, fruits and livestock, in various agricultural systems. While, Vision Transformers (ViTs), a natural language processing alternative to convolutional neural networks by capturing global context through self-attention, have shown great potential in object detection. However, the field of ViT-based detectors remains fragmented, with independent advances in plant and animal studies and a lack of comprehensive analysis connecting these domains. To bridge this gap, we conducted a systematic review, retaining 30 primary studies after a dual screening and quality appraisal process—20 focused on plant production and 10 on animal production. Our analysis shows that ViT-based models excel in multi-scale representation, complex scene reasoning, and efficient feature extraction. These capabilities give high accuracy in fruit quality assessment, crop growth monitoring, weed detection, meat grading and livestock behaviour surveillance. However, challenges such as high computational complexity, large parameter sizes, environmental variability, small object detection, and data annotation requirements remain. For researchers and practitioners, this review offers a unified framework to understand ViT-based detection. It pinpoints cross-domain challenges and concludes with a forward-looking pathway to turn these insights into practical, on-farm solutions.

Keywords: Convolutional neural network, Computer vision, Deep learning; Machine learning, ViT, YOLO

Indicators

Metrics

Cite Score: 1.3

JCR Year: 2025

Indexing

Status

Web of Science (SCIE)

SCOPUS (Q3)

Journal Metrics

Current

Journal Impact Factor: 0.5

HEC Category: W

ISSN Details

Verified

Print ISSN: 1018-7081

Electronic ISSN: 2309-8694

Search the Journal

Use the fields below to search for articles by Title, Author, or Keywords.