CNN-LSTM implementation methodology on SoC FPGA for human action recognition based on video
Ver/ Abrir
Registro completo
Mostrar el registro completo DCFecha
2024Derechos
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Publicado en
27th Euromicro Conference on Digital System Design (DSD), París, 2024, 202-209
Editorial
Institute of Electrical and Electronics Engineers Inc.
Enlace a la publicación
Palabras clave
AMD-xilinx
CNN-LSTM
Deep learning
HAR
SoC FPGA
UCFI0l
Vitis-AI DPU
ZCUI02
Zynq ultrascale+MPSoC
Resumen/Abstract
The growing use of AI -driven video applications like surveillance or healthcare monitoring underscores the need for embedded solutions capable of accurately categorizing human actions in real-time videos. A methodology is proposed for implementing a customized CNN-LSTM architecture on AMD-Xilinx SoC FPGA devices for human action categorization from video data. In this approach, CNN operations are accelerated by the Vitis-AI DPU within the FPG A, offering flexibility to support a range of CNN architectures without requiring individual hardware description language development. This adaptability is crucial given the varying performance of CNN models across datasets. LSTM operations are executed on the SoC processors, overcoming limitations in the support provided by DPU IP cores for such networks, while maintaining flexibility to assess different configurations. Additionally, a pipeline strategy is proposed to enable parallel execution of both CNN and LSTM components, optimizing resource utilization and minimizing idle times. To demonstrate the validity of the proposed implementation methodology, experiments were conducted on the ZCUI02 de-velopment board, equipped with a Zynq Ultrascale+ MP-SoC, and involved the use of the VGG 16 CNN model along with the exploration of different LSTM configurations. The results demonstrate remarkable computational performance, achieving frame rates of up to 44.34 FPS for videos recorded at a resolution of 320×240 pixels, surpassing real-time requirements. Aditionally, the proposed implementation maintains high accuracy levels, exemplified by the single bidirectional LSTM layer achieving a competitive accuracy of 73.33% based on the UCF10l dataset.
Colecciones a las que pertenece
- D50 Congresos [464]
- D50 Proyectos de Investigación [404]