SOVAR: System of Visual Assistance and Recognition

Ken Huang

doi:10.54254/2755-2721/81/20241085

Research Article

Open access

Published on 8 November 2024

Download pdf

Huang,K. (2024). SOVAR: System of Visual Assistance and Recognition. Applied and Computational Engineering,81,181-189.

Export citation

SOVAR: System of Visual Assistance and Recognition

Ken Huang *^,1,

¹ The Athenian School, Danville, USA

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2755-2721/81/20241085

Abstract

Around 295 million individuals globally suffer from moderate to severe vision impairment. They struggle with daily activities and depend heavily on others for assistance. Leveraging augmented reality (AR) and artificial intelligence (AI), we have developed SOVAR, a mobile application that enables greater independence for visually impaired individuals in their daily lives. SOVAR includes two modules: navigation and scene understanding. Navigation involves two phases: mapping and guidance. During mapping, SOVAR builds and optimizes the maps with key locations labeled by users via voice input. During guidance, SOVAR plans the path and guides users to requested key locations, with visual and audio assistance and realtime obstacle avoidance. The scene understanding module includes a Large Vision Language Model (LVLM) to help the users through image captioning and visual question answering. For navigation, our user study shows that participants successfully navigated to the key locations from three separate locations in 86.67% of trials without intervention. The success rate improves with increased user familiarity with the application. For scene understanding, our study shows that leveraging LVLMs to help the visually impaired allowed the participants to answer the visual-related questions with an accuracy of 100%. In this work, we developed SOVAR, a mobile application leveraging AR and AI to assist the visually impaired in navigation and scene understanding. The promising results from the user study on SOVAR demonstrate the effectiveness of AR and AI for visual assistance and indicate their potential impact on general assistive technologies.

Keywords

Augmented reality, AI, navigation, visually impaired.

View pdf

References

[1]. World Health Organization. (n.d.). Blindness and visual impairment. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment (Accessed on August 14, 2024).

[2]. Georgetown University. (n.d.). Visualizing health policy. Georgetown University Health Policy Institute. https://hpi.georgetown.edu/visual/ (Accessed on August 14, 2024).

[3]. [OpenAI. (n.d.). ChatGPT. OpenAI. https://chat.openai.com/ (Accessed on August 14, 2024).

[4]. Apple Inc. (n.d.). Apple Vision Pro. Apple. https://www.apple.com/apple-vision-pro (Accessed on August 14, 2024).

[5]. Fortune Business Insights. (n.d.). Augmented reality (AR) market size, share & COVID-19 impact analysis, by component, by device type, by industry, and regional forecast, 2023-2030. Fortune Business Insights. https://www.fortunebusinessinsights.com/augmented-reality-ar-market-102553 (Accessed on August 14, 2024).

[6]. Kim J T, Yu W, Kothari Y, et al. Transforming a quadruped into a guide robot for the visually impaired: Formalizing wayfinding, interaction modeling, and safety mechanism[J]. arXiv preprint arXiv:2306.14055, 2023.

[7]. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.

[8]. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[9]. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.

[10]. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.

[11]. [Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.

[12]. Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.

[13]. Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.

[14]. Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chat- bots beyond language. arXiv preprint arXiv:2305.05662, 2023.

[15]. Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.

[16]. Radford A, Kim J W, Xu T, et al. Robust speech recognition via large-scale weak supervision[C]//International conference on machine learning. PMLR, 2023: 28492-28518.

[17]. Groq. (n.d.). Groq runs Whisper Large v3 at a 164x speed factor according to new Artificial Analysis benchmark. Groq. https://wow.groq.com/groq-runs-whisper-large-v3-at-a-164x-speed-factor-according-to-new-artificial-analysis-benchmark/ (Accessed on August 14, 2024).

[18]. LangChain. (n.d.). Agent types. LangChain Documentation. https://python.langchain.com/v0.1/docs/modules/agents/agent_types/ (Accessed on August 14, 2024).

[19]. Apple Inc. (n.d.). ARWorldMap. Apple Developer Documentation. https://developer.apple.com/documentation/arkit/arworldmap (Accessed on August 14, 2024).

[20]. Supabase. (n.d.). Hybrid search. Supabase Documentation. https://supabase.com/docs/guides/ai/hybrid-search (Accessed on September 19, 2024).

Cite this article

Huang,K. (2024). SOVAR: System of Visual Assistance and Recognition. Applied and Computational Engineering,81,181-189.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Machine Learning and Automation

Conference website: https://2024.confmla.org/

ISBN：978-1-83558-563-4(Print) / 978-1-83558-564-1(Online)

Conference date: 21 November 2024

Editor：Mustafa ISTANBULLU, Xinqing Xiao

Series: Applied and Computational Engineering

Volume number: Vol.81

ISSN：2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).