Environment Model Configuration from Low-quality Videos
López
De Luise Daniela
CAETI –
Universidad Abierta Interamericana – Facultad de Tecnología Informática
Av. Montes de Oca
745, Ciudad de Buenos Aires, Argentina.
CI2S Labs
Pringles 50, Ciudad de Buenos Aires, Argentina.
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
daniela_ldl@ieee.org
Park Jin Sung
CI2S Labs
Pringles 50, Ciudad de Buenos Aires, Argentina.
zeroalpha2000@gmail.com
Hoferek Silvia
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
Universidad Siglo
21, Decanato de Ciencias Aplicadas, Argentina
srhoferek@gmail.com
Avila Lautaro Nicolás
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina..
lautyx027@gmail.com
Benitez Micaela Antonella
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
benitezmicaelaantonela@gmail.com
Bordon Sbardella Felix Raul
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
raulbordon250@gmail.com
Fantín Rodrigo Iván
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
rodrignr4@gmail.com
Machado Gastón Emmanuel
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
gastonmachado44@gmail.com
Mencia Aramis Oscar
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
Ríos Anahí Ailén
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina..
anahiriosailen@gmail.com
Rios Emiliano Luis
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina.
mcwiths@gmail.com
Riveros Nahuel Edgardo
Instituto de
Investigaciones Científicas (IDIC), Universidad de la Cuenca del Plata (UCP),
Facultad de Ingeniería, Tecnología y Arquitectura, Formosa, Corrientes,
Argentina..
Nahuel42425@gmail.com
Abstract—This article aims to describe main findings on a prototype for assisting blind people. To improve
its functioning the main approach is to build a model dynamically using Intelligent System and Machine Learning.
After several partial models the
prototype is able to detect and recognize the outline of a user environment,
specifically to determine the spatial organization of multiple objects. This paper encompasses a comprehensive set
of activities aimed at evaluating and enhancing a system with efficient metrics
for feature assessments upon, video , image segmentation, and data mining on
the fly. Additionally, this work covers automatic image tagging, and a set of
risk rules. It also evaluates and depicts specific techniques and approaches to be applied to create models with
high pattern-detection efficiency. The algorithm used is required to be light and quick, in order to be used
in standard cell phones to assist blind people and provide meaningful
information to the user. As part of the current paper a small statistical
analysis is also performed.
Keywords—Blind people
assistance, Video processing, Object detection, Data Mining, Environment configuration.
I. Introduction
The most challenging activity for blind people is to walk outdoors independently of any support. HOLOTECH [1] aims to provide guidance in this context by simply using a standard cell phone. It outperforms many of current proposals [2][3][4] as traditional tools are the cane, trained animals, and assistants. World statistics indicate that individuals with visual impairments grow year by year [5] [6]. Exposure to nature has benefits for people's mental and physical health. One way to achieve this is by ubiquitous and mobile technologies. However, existing research in this area is primarily focused on people without visual impairments and is not inclusive of blind and partially sighted individuals.
Outdoor experiences in the natural environment for these people present specific needs and barriers that could be addressed by technology. According to [8] they can be classified into three main concerns: independence, knowledge of the environment, and sensory experiences.
For most people who are blind, exploring an unknown environment can be unpleasant, uncomfortable, and unsafe. Authors in [9] explore an adaptation of the use of virtual reality as a learning and rehabilitation tool for people with disabilities, based on the hypothesis that the supply of appropriate perceptual and conceptual information through compensatory sensorial channels may assist people who are blind with anticipatory exploration. His work aims to allow the user to explore a virtual environment with two main goals: evaluation of different modalities (haptic and audio) and navigation tools, and evaluation of spatial cognitive mapping employed by people who are blind. Preliminary results indicate that comprehensive cognitive maps can be built by exploring the virtual environment.This study’s results indicate that prototypes and research in this way are very promising.
Since blindness can be caused by things like genetics, infection, disease or injury the help these individuals may highly differ in every case. As a case, someone with 20/200 vision sees an object from 20 feet that a person with 20/20 vision is able to see from 200 feet.Then, environmental Challenges
for People who are
completely blind or have impaired vision usually depends on its origin and age.
In spite of that, they have a difficult time navigating outside the spaces that
they're accustomed to. In fact, physical movement is one of the biggest challenges
for blind people, since it requires a special explanation of World Access for
the Blind [10]. One of the most compelling difficulties with
different types of blindness [7] is the loss of awareness of the environment,
exposing them to a high risk of accidents. Based on the grade of blindness and
the individual's age, the risk factor varies for older people. Falls are more
prevalent in this group and can have a greater impact, potentially leading to
more significant complications. Despite this, there are studies dedicated to
preventing such issues [11], but they require time and effort. In addition to the issues
under investigation, new complications associated with the global pandemic have
emerged [12]. Since
most visually impaired individuals are not independent, the necessary social
distancing measures during the pandemic have introduced complications,
significantly impacting their daily lives. While the pandemic may be coming to
an end, there are still existing limitations in the tools available to aid
blind individuals with mobility. In addition to the numerous advantages
provided by the traditional white cane [13], there are also several disadvantages.
Addressing these limitations and disadvantages is both an issue and a necessity
for the blind community.
Recognizing
the necessity for further assistance to enhance the autonomy and independence
of individuals with visual impairments, we are utilizing current technology and
machine learning to fulfill this need.
The
project HOLOTECH aims to provide information and complement the lost awareness
in an alternate way: with a simple language based on sounds. The first steps of
the prototype are to get video, detect certain objects by pattern matching,
analyze their location, velocity, and direction, and finally generate an
audible alarm with certain features mapping main information about surrounding
obstacles.
This work
focuses on the tuning of pattern-matching steps, which is relevant to increase
fidelity and precision in real-time. The approach is to train a Machine
Learning model to detect a specific type of object and to use it for
recognizing obstacles. The current study involves methods to guarantee the
accuracy of the detection and improve coding performance to be lightweight
enough to be used on a portable device. To adjust the accuracy, newly created
models are compared to legacy pre-trained models that come with OpenCV[14]. New models correspond to
objects that are to be detected but do not exist in the database of pre-trained
models. All of them are obstacles of interest found with the interaction with
potential users contacted as part of the collaboration of the Circle of the
Non-Sighted (CINOVI). They determined a specific list of potential risks and
objects that need to be identified. Note that the interaction with real users
helps to debunk misconceptions, such as the need to detect people in the environment,
since they can easily perform this. It is interesting to mention that the
objects that constitute the most frequent problems can be handled with a cane.
But those requiring more complex detection, are typically at a high of one
meter or more from floor level or pits that are at a strange angle and can't be
detected with that device.
Several
previous results from the project with members of the Argentine Library for the
Blind (BAC), the CAETI center of UAI, and the current team in conjunction with
CINOVI, are in [1][15][16]. The information derived from the models implemented in the
prototype can detect any type of object with different degrees of confidence.
This article presents the approach of multiple model synchronization intended to be handled
by an expert system.
The
background of the current proposal relies on multiple advances in the field of
video processing. Among others, slicing, segmentation, and Machine Learning
(ML) have been used by López De Luise et. al. [17] for 3D inferred Scenes
using 2D images (Fig. 1), and mood inference, where an AI artist changes a
piece or human art to incorporate the inferred observer internal status
choosing among intelligent "brushes", and then updates the picture in
real-time (Fig. 2). Other authors performed smart coloring from grayscale images
[18],
endoscopic 3D reconstruction [19], image restoration [20], deep image inference [21], video compressing [22], and other extensive
applications [23] of computational intelligence on video processing.
Fig.
1. 2D figures (left) and
resulting 3D inferred figures (right)
Fig.
2. Table of Lumiere productions (an AI artist) using mood
inference
The rest
of this text covers the case study, statistical analysis, and conclusions
derived from training of the models for the prototype. It is important to
highlight that identifying the objects and analyzing them is key for the
progress of the prototype as the main goal is to provide precise information to
lower potential risks of collision during indoor and outdoor displacement for
individuals with visual impairments.
II. Methodology and
Materials Selection
Due to
the context for using the proposed solution, the concept for the prototype’s
architecture involves utilizing compact, lightweight hardware, such as a
smartphone in conjunction with an Arduino Nano. This combination enables the
model to incorporate hardware that is typically not found in a smartphone, such
as an ultrasound sensor. The purpose of this architecture is to serve as an
initial prototype, facilitating our exploration of the proposed solution.
However, a crucial aspect of the prototype is to have hardware capable of
capturing information, recognizing, and executing various routines and actions.
Therefore, the minimum requirement for the architecture is the ability to
capture video and process it using a trained model for recognition. The idea to
recognize objects is to predict and filter behavior, a chair or desk are
unmovable objects by themselves so the only way for a impaired person to
collide is by his own movement, the same can be applied to other objects like
trees. For that reason it is imperative to understand that even if the
requirement is to be detected in real-time the process to recognize can take
some reasonable time. For that reason the architecture proposed is centered
using a smartphone. Nevertheless for real-time cases that need fast detecting
reactions we are incorporating ultrasonic sensors that will allow us to know
when a sudden object appears in the trajectory and in this case the need to
know what object is in the path doesn't matter, only the direction where it came from, the proposed architecture
also aims to facilitate communication with impaired individuals through various
methods, such as using sound or vibration in extreme cases. The article focuses
on detecting and recognizing the object from a certain distance: how precise
are the models and with the architecture used as a base.
Considering
the hardware restrictions, there is a strategic limitation on the model to be
designed for identifying any object of interest. In previous research,
statistics show this limitation imposed on images used to train the model with
quality and resolution around 320x240 dpi to 720x480 dpi.
In order
to test and train the model, there is a pre-processing to improve the boundary
of any targeted object, a tool named Labelme [24] is used. It performs
automatic graphic image annotations. This enables data for detecting the
desired object using polygons and tags to classify it automatically. These
metadata are stored in a JSON file. There is an extra step to adapt the JSON
formatting to YOLO [25], a conversion required before the training [26]. The Neural Network tool
used here for object detection is YOLO for its unique approach that ensures
quickness. There are other popular tools like Faster R-CNN [27] and SSD [28], but YOLO is simpler to
use and more popular, with better community support.
This
project employs PyTorch [29] to implement the object detection model based
on YOLO [25]. PyTorch offers tools for defining neural network
architectures, optimizing models, computing gradients automatically, enabling
distributed training, and more. Additionally, PyTorch
is compatible with execution on either GPU or CPU, providing a valuable option
for users without access to a CUDA-enabled graphics card. Moreover, Ultralytics [30] was utilized for training the neural network,
and Pillow [31].
As a first step to the multi-model management of the prototype, the models are for a limited diversity of objects. They correspond to those that currently exist in the pre-trained model. This way it is possible to compare their performance with the newly created models with HOLOTECH.
The comparison of both models is a preliminary step to evaluate the degree of precision increment in the processing with models improved with ML. Despite that, it is important to remark that for a tiny subset of targets, the pre-trained models will be used. The option will depend on the performance of the newly created models. In some cases, pre-trained models are not loaded to reduce the overloading of information, reducing the number of patterns and therefore minimizing the firing rate of filtering data tasks.
The selected targets to work in the current article are “chairs”, “desks”, “doors”, and “persons”. Test conditions for all of the objects are mostly found indoors and outdoors. As an intermediate procedure, there is an additional validation of the need to use a model based on the classification group. This is because in previous research special cases emerged where the shapes of different obstacles induce mistakes in the patterns. As an example, some chairs and desks look very similar depending on the perspective.
Finally, two special targets are doors and persons. The first is because they do not present any pre-trained model in the tool. The last is because they will be detected and filtered out. Blind people do not need any alarm disturbing every time a person approaches the individual, as they don't represent any collision risk.
II. Architecture Schema
With the ongoing shift in technology there are 2 architecture approaches, the first one is creating a standalone python application that has the models included inside. The second one is a client-server architecture where the client sends the data feed to the server and the server is in charge of processing and applying the model.
Each of the architectures has their pros and cons, for the first architecture we found that is hard to create the binary for the differents hardware (Android - Iphone), more so that both hardware theirs software requires that the binary is made of a specific programming language and neither of those are adapted easily to use A.I models. Furthermore in case that needs to be provided and updated it would require re-building the application. However once created the advantage is that it would allow it to respond faster and it would not be dependent on external resources.
On the other hand, the second architecture would be the other way around, because it is separate in 2 client and server, the advantage would be that there won’t be a compatibility issue in the client side, furthermore the models would be on server, this would allow any update that requires a change on the model won't impact the client in a bad way. Contrary to the first architecture, this architecture is easier to scale in the future. Even so the disadvantages are that it would be dependent that the hardware has an internet connection, even more the server needs to be hosted on a server that runs 24hs, more users would require a more sizable server. All of this would require a significant amount of cost.
For the purpose of this research, we moved forward to implement both alternatives. To understand where would be the point that neither of the alternatives would be not more viable in the future. Currently both are valid alternatives.
III. Prototype Performance
and Test Cases
The main
goal of this article is to present the first stage of the tuning process for
the HOLOTECH prototype. The threshold for efficiency metrics has been defined
to ensure a precision rate with minimum quality in real-life functioning. From
the set of pre-trained models, a lower boundary of the acceptance interval is
defined. Note that every performance in the current context corresponds to
tests running against a database of video clips recorded according to the list
of selected targets. As mentioned previously, the preference choices refer to
those obstacles declared as of interest due to their potential risk during the
indoor and/ or outdoor displacement. The list is the result of interactions
with volunteers in collaboration with the CINOVI, and consists of both indoor
and outdoor obstacles and/or events.
Some of
the elements included in the priority list are:
●
Cars
parked on sidewalks.
●
Cars
parked in unauthorized spaces.
●
Motorcycles
parked in unauthorized locations or moving irregularly.
●
Bicycles
parked in unauthorized locations or moving irregularly.
●
Open
gates or gates opening onto the street.
●
Plant
branches or any object protruding the path of travel, at a knee height and
above.
●
Any
public municipal ramp.
●
Devices
installed on walls, such as air conditioners at non-recommended heights,
especially when protruding the pathway.
●
Potholes
or breaks in the ground, with or without fencing.
●
Potentially
dangerous moving obstacles such as strollers, various vehicles, cyclists, etc.
The list
continues and might be extended after the initial deployment of the actual
prototype after tuning the performance within the acceptance level interval.
IV. Model Training and
Results
This
section explains the tests performed for making the pattern recognition neural
model with YOLO and depicts some of the essential steps to fit the performance
in YOLO to build a better model. The activity is performed on two versions of
the model evaluation.
A.
First version: reduced set
The
training aims to build a model that outperforms the one included by default in Labelme. The process includes a validating step, the
accuracy level assessment, and a final comparison between the new model.
Special attention deserves the fact that the training database of videos used
for the legacy neural network in the platform is not available to end users.
Therefore a new training set with similar obstacles has to be generated and
labeled in advance. The object classes are named here as ModelGroup,
and each of them has an independent model specifically trained for the subset
of specific objects in it. Considering the reduced list covered in the current
test, the training encompassed four models, each utilizing a set of 150 images.
Every image corresponding to any of the target objects has been conveniently
labeled using Labelme.
In order
to set the necessary information from an image, the tool Labelme
is used to streamline the process. The tagging is carried out manually to
ensure that obstacles are accurately identified within the image. Fig. 3 shows
a screenshot of the process.
Fig.
3. Tagging process using Labelme
As can be
seen from the picture, Labelme lets the user define
metadata without the need to use complex image processing concepts to filter unwanted
information and to retrieve the required data. Once an image is tagged Labelme provides the image context characteristic, and
features of the tagged object in a JSON format (Fig. 4).
Fig.
4. Example features
automatically generated in JSON format
The JSON
file requires some extra processing to transform it into the required YOLO
format, as the set is then fed to YOLO. This step is performed with a tool
called Labelme2yolo which generates the structure for model training.
Additionally, the tool facilitates the segmentation of images into test, train,
and validation subsets, and if necessary, it can apply pre-processing
techniques to convert the images as in Fig. 5.
1) Dimension issue
In order to enhance image processing performance, resizing images to standard sizes is implemented: small (100 dpi or 320x240), medium (200 dpi or 500x300), and large (300 dpi or 720x480). Despite potential minor distortion of object shapes by approximately 1% or less during rescaling, this procedure centers and refines images by eliminating noise during processing and model development.
To accomplish this resizing process, we devised a compact system capable of obtaining images with their respective names and specified formats. The Python Imaging Library (PIL) was utilized as the tool for resizing. For each size category, a precondition was established: if the image size exceeded a certain threshold, it would be resized to fit within the designated size for that category, and the updated file would be saved with the same name and extension.
Fig.
5. Example of a pre-processing
step
2)
Model
The Model
train-set set comprises a set of chair, desk, and door pictures. The premise is
to apply the pattern matching to similar objects, as mentioned previously, to
determine the confusion date of the model, evaluate the confidence level of the
object identification, and limit the failure rate.
The model
training sessions consist of 100 epochs using the yolov8m-seg model type from
YOLO. Training is performed in two independent batches. Due to the use of a Mac
Pro workstation, the training process took several hours to complete. It is
important to note that this is not a problem since the model is not expected to
be retrained during its application in the field. Nevertheless, it is possible
to reduce the training time by utilizing an NVIDIA graphics card in a Linux
environment. The configuration used to train the models is the initial step in
understanding the response of the training and fine-tuning it if necessary to
improve the precision of the model.
To
emphasize the object in the individual models, an additional
"background" tag is added to indicate when an object other than the
expected object was selected. This was done in an attempt to create false
positives and have a clear understanding of the background noise (see an
example of the problem in Fig. 6).
Fig.
6. Example of tagging with
included background
3)
Results
It is
interesting to note that the ModelGroup, managed to
find the object's category but failed to correctly identify the specific object
when it was trained with objects sharing similar characteristics, such as a
desk and a chair. The problem is shown in Fig. 7.
Fig.
7. Comparison of the tagged
objects and predicted outcome
The
precision of some identified objects is high enough for the model to be
confident that the object is a desk, with a score of 0.8. However, in some
cases, the model still finds objects incorrectly and determines a wrong
classification with a high confidence score. The problem arises just in cases
where the out-shape is similar between two or more patterns (as in the case of chair and
table in Fig. 7, where the model ends up identifying everything as a desk). One solution for
this particular case was to add a boundary restriction to emphasize that a
chair needs to be within a certain dimension, and the same for the desk.
However, this solves specific cases like the confusion between chairs and
desks. As more diverse objects are to be added to the model training, it will
not be ideal and the restrictions will become more complex.
It seems
like the models trained individually with the background did not end up being
as precise as expected, as shown in Fig. 8.
Fig.
8. Predicted output (right)
using dedicated models
This
time, the precision of the predicted output was not perfect. As can be seen in
the picture, the model ended up finding multiple readings regarding the objects
identified, with a confidence range between 0.3 to 0.7, and identifying an
object as a background too. The primary problem with the model appears to be
the various interpretations of the recognized objects, with confidence scores
ranging from 0.3 to 0.7. Although the identification of the background as an
object can be eliminated, the consistent trend of readings falling within this
confidence range for the predicted objects is worrisome. This issue renders the
models unsuitable for use, as it could potentially result in hazardous
situations for individuals.
To better
understand the usability of the models, a study was conducted using the output
from the training. This data can be used to statistically understand the models
and fine-tune them to make them more precise. The outputs are presented in Fig.
9 and Fig. 10.
Fig.
9. Results of the group
training model
Fig.
10. Results of the desk
training model
The
addition of the background tagging for the individual trained model was to
reduce the background noise found in the analysis of the confusion matrix
trained on the ModelGroup model Fig. 11.
Fig.
11. Confusion matrix for ModelGroup
Although
the data used to train the model did not contain a background tag, the model
predicted the desk object as background in about 77% of the cases. In order to
reduce the rate of mispredictions, the background was provided during training
for the other models.
The inclusion of the background tag did not result in the expected enhancement in the individual models. This caused the background to be identified as a separate object, as shown in Fig. 12. This indicates that additional modifications or fine-tuning may be required to resolve this problem. Although the misprediction rate was lower than the other models, there were still too many issues with the outcome.
Fig.
12. Confusion matrix of Desk
Model
The predicted
outcome for the desk object was similar for both models, with a prediction of
38% and 36%, respectively.
B.
Second version: optimized set
The training process was conducted similarly to the previous version, but with the distinction that this time, the dataset consisted of 38 curated images primarily focusing on two categories: people and chairs. Given the nature of the objects of interest (people and chairs), a new phase was introduced to identify these objects. In this process, additional images were gathered to augment the dataset for testing purposes. The primary goal is to evaluate the effectiveness of the newly developed code in detecting these specific objects.
Fig.
13. Tagging of persons
1)
Model
The new
model is named ModelGroup, focused on detecting
people and chairs as previously mentioned. This model was trained for 30 epochs
in 2 batches using YOLO. The training was conducted on a workstation running
Windows OS, utilizing CPU processing which led to longer training times (Fig.
14).
2)
Results
Upon
evaluating the results obtained from ModelGroup, we
observed an improvement in detecting people compared to previous training
iterations. However, the model still struggles to detect multiple individuals
accurately. Specifically, in one instance depicted in Figure 15, half of the
group of people was not tagged. The precision of object tagging in these
results achieved a confidence level of 0.6 (see fig 16 – 18).
Fig.
14. Tagging of persons
Fig.
15. Comparison between tagged
and predicted persons
Fig.
16. Bar prediction rate for
tagging
Fig.
17. Plot prediction hit for
tagging
Fig.
18. Confusion matrix
Based on
figures 16, 17, and 18, it can be seen an improvement in the results obtained
compared to the previous instance.
C.
Third version: additional objects’ set
For the
third iteration, we followed a similar procedure as before but made a
modification to the dataset by including an additional tagged object,
specifically tables. The objects of interest, now totaling three for this case
and trained simultaneously, presented certain challenges and difficulties in
identifying and tagging each element. The goal here is to assess the
effectiveness of training with a group of three different object types.
The JSON
file requires some extra processing to transform it into the required YOLO
format, as the set is then fed to YOLO.
Fig.
19. Hit rate of the extended
tagged objects
Fig.
20. Cohen metric of the
extended tagged objects
D. Fourth
version: extended set
In this
latest test, the samples are only images tagged as containing persons and
supplemented the dataset with additional images, bringing the total to 271.
Since its focus is on a single object in the scenario, the model finds it easier to
detect a target. The aim of this version is to focus solely on a single object
and to observe if this approach yields better results and detection rate. The
training is kept with 100 epochs with 6 batches using YOLO, following the same
configuration as in previous models to facilitate comparison.
With an
increased number of images and tagged objects to refine the neural network, we
have achieved the best results thus far, with responses and detections
approaching perfection, accurately tagging the full body of individuals.
However, one issue we encountered was with objects or people situated too far
away or overlapped with other elements, as illustrated in the figures 21 and
22.
Despite these complications, it can be stated that with a higher frame rate, results with less noise can be obtained, thus allowing better control of the situation with this group of individuals and having a certainty rate close to 90/95%.
V. Conclusions
The
analysis of the models for identifying different targets show similar predictions accurate
no matter the type of detection. Also, the extra work required to filter noise,
does not provide significant improvement in the hit rate. As it is just for
objects that represent low risk for blind people it can be avoided. Regarding
the splitting of a global model into class models, tests show that it is
necessary but shall not be trained for every object but for predetermined
classes or ontological groups in order to improve precision. As shown in tests,
there are different ways to implement this. Among others creating models
without background tagging, improving the
input data set, adding more false positives, and fine-tuning the model.
Fig.
21. Rate plots of the extended
set
Fig.
22. Cohen metric of the
extended set
At the current state of this research it is not possible to
reach the same level of precision as the pre-trained commercial models in
certain specific cases presented here. Therefore it requires extra work. Regarding the architecture
implemented in the prototype, there also remains a tuning for selecting a proper
aggregation of objects.
The
evaluation of the models for identifying target objects revealed that whether
using a group of objects or analyzing individual objects, the predictive
outcomes are largely comparable. Nevertheless, superior performance is observed
with individual object analysis, leading to enhanced detection, reduced noise,
and a notable decrease in false positives.
VI.
Risk
Rules
For the application of models and to allow a fast response of the system, we set a couple of rules where depending on coordinates of the object identified the system would proceed to further track or ignore it. The concept is that there would be no point to follow up an object that would not present any foreseeable damage to the user. For that there would be some constants and variables that we would set depending on the user and after the detection apply the different equations, as illustrated in figures 23, that would allow us to determine the risk the object would incur to the user. We used in the equations 2 sequence of detection to identify the velocity, dispersion and direction of the object.
Fig. 23. Some of the equations to determine the risk
VII. Future Work
This
paper aims to enhance the existing models in the project improving precision,
and considering a wider range of objects for detection. The resulting models
are to be with similar performance as any pre-trained commercial model. The
next step is to feed results to an Expert System to interpret the context and
use the correct model in a sequence.
Further investigation is required to assess how these models perform with further targets. The creation of this system will derive in an automatic interpretation of the environment using additional technologies like a GPS, or movement sensor, to gain extra knowledge of the individual’s indoor or outdoor environment, to analyze the existence of any risks in its close or far away surroundings.
Depending on the architecture selected some of the ideas would provide a more robust challenge to be applied, however that would allow us to determine different alternatives or external resources that would help us improve the prototype.
The improved prototype will aim to be adequate for
a real-time response and validation of the context where objects move and eventually produce alarm
signals by means of a sound language shared with users for seamless
communication between the prototype and its user.
Incorporating a GPS signal could substantially enhance contextual awareness by identifying whether an individual is indoors or outdoors, and it could leverage specific types of models for this purpose.
References
[1] Park, J.S., De Luise, D.L., Hemanth, D.J., Pérez, J. (2018). Environment Description for Blind People. In: Balas, V., Jain, L., Balas, M. (eds) Soft Computing Applications. SOFA 2016. Advances in Intelligent Systems and Computing, vol 633. Springer, Cham. https://doi.org/10.1007/978-3-319-62521-8_30
[2] Bryant Penrose, R. (2023). Anticipating Potential Barriers for Students With Visual Impairments When Using a Web-Based Instructional Platform. Journal of Visual Impairment & Blindness. Tools of the Blind and Visually Impaired. Volume 117 Issue 5
[3] Curing Retinal Blindness Foundation (2023). Tools of the Blind and Visually Impaired. https://www.crb1.org/for-families/resources/tools
[4] Blasch, B. B., Long, R. G., and Griffin, Shirley N. Results of a National Survey of Electronic Travel Aid Use. Journal of Visual Impairment and Blindness, November, 1989, v. 33, n 9, pp 449-453.
[5] WEBAIM (2021) Screen Reader User Survey #9 Results. Web
accessibility in mind. Institute for Disability Research. Utah State
University. Last updated: Jun 30, 2021. https://webaim.org/projects/screenreadersurvey9/
[6] The Lancet Global Health Commission on Global Eye Health:
vision beyond 2020. Crossref DOI link: https://doi.org/10.1016/S2214-109X(20)30488-5
[7] Cleveland Clinic (2022) Blindness. https://my.clevelandclinic.org/health/diseases/24446-blindness
[8] Understanding Experiences of Blind Individuals in Outdoor Nature. M. Bandukda · A. Singh · N. Bianchi-Berthouze · C. Holloway. DOI: 10.1145/3290607.3313008. Conference: ACM CHI'19 · 2019
[9] A Virtual Environment for People Who Are Blind – A Usability Study. O. Lahav, D W Schloerb, S Kumar, M A Srinivasan. J Assist Technol. 2012; 6(1). doi:10.1108/17549451211214346. 2016
[10] Challenges That Blind People Face. Written by Kate Beck 18 December, 2018. HealthFully (https://healthfully.com/).Leaf Group Ltd.
[11] Insight (Lawrence). Author manuscript; available in PMC 2018 Dec 28. Published in final edited form as: Insight (Lawrence). 2011 Spring; 4(2): 83–91.
[12] Front Psychol. 2022; 13: 897098. Published online 2022 Oct 28. doi: 10.3389/fpsyg.2022.897098
[13] Rasouli Kahaki, Z., Karimi, M., Taherian, M. et al. (2023) Development and validation of a white cane use perceived advantages and disadvantages (WCPAD) questionnaire. BMC Psychol 11, 253. https://doi.org/10.1186/s40359-023-01282-4
[14] Holzer, R. (2019) OpenCV tutorial Documentation. Release 2019. pp 125
[15] Park, N. et al. (2021). Multi-neural Networks Object
Identification. In: Balas, V., Jain, L., Balas, M., Shahbazova, S. (eds)
Soft Computing Applications. SOFA 2018. Advances in Intelligent Systems and
Computing, vol 1222. Springer, Cham. https://doi.org/10.1007/978-3-030-52190-5_13
[16] López De Luise, D., Park Jin , S., Hoferek , S., Avila Lautaro, N., Benitez Micaela, A., Bordon Sbardella, F. R., Fantín, R. I., Machado, G. E., Mencia Aramis, O., Ríos, A. A., Luis, E. L., & Riveros, N. E. (2023). Detección Automática de Objetos como asistencia a Personas Invidentes. Revista Abierta De Informática Aplicada, 7(1), 37–50. https://doi.org/10.59471/raia202356
[17] Furundarena, F., López De Luise, D., Veiga, M. (2022) Computational Creativity through AI modeling. CASE 2022
[18] Komatsu, T., Saito, T. (2006). Color Transformation and Interpolation for Direct Color Imaging with a Color Filter Array. International Conference on Image Processing, pp. 3301-3304, doi: 10.1109/ICIP.2006.312878
[19] Imtiaz, M. S., Wahid, K. A. (2014) Image enhancement and space-variant color reproduction method for endoscopic images using adaptive sigmoid function. In 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3905-3908, doi: 10.1109/EMBC.2014.6944477
[20] Wen, Y. W., Ng, M. K., Huang, Y. M. (2008) Efficient Total Variation Minimization Methods for Color Image Restoration. In IEEE Transactions on Image Processing, vol. 17, no. 11, pp. 2081-2088, doi: 10.1109/TIP.2008.2003406
[21] Kuo, T., Hsieh, C., Lo, Y. (2013) Depth map estimation from a single video sequence. In IEEE International Symposium on Consumer Electronics, pp. 103-104, doi: 10.1109/ISCE.2013.6570130
[22] Yakubenko, M. A., Gashnikov, M. V. (2023) Entropy Modeling in Video Compression Based on Machine Learning. In IX International Conference on Information Technology and Nanotechnology (ITNT), Samara, Russian Federation, pp. 1-4, doi: 10.1109/ITNT57377.2023.10139143
[23] De Siva, N. H. T. M., Rupasingha, R. A. H. M. (2023) Classifying YouTube Videos Based on Their Quality: A Comparative Study of Seven Machine Learning Algorithms. In IEEE 17th International Conference on Industrial and Information Systems (ICIIS), Peradeniya, Sri Lanka, pp. 251-256, doi: 10.1109/ICIIS58898.2023.10253580
[24] Russell, B. C., Torralba, A., Murphy, K. P., Freeman, W. T. (2005). LabelMe: a database and web-based tool for image annotation. MIT AI LAB MEMO AIM-2005-025, SEPTEMBER, 2005
[25] Upulie, H. D. I., Kuganandamurthy, L. (2021). Real-Time Object Detection Using YOLO: A Review. DOI: 10.13140/RG.2.2.24367.66723
[26] labelme2yolo 0.1.3 (2023) Project Description. October 2023
release. https://pypi.org/project/labelme2yolo
[27] Ren, S., He, K., Girshick, R.,
Sun, J. (2016). Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks. Cornell University.
Doi: https://arxiv.org/abs/1506.01497
[28] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A.C. (2016). SSD: Single Shot MultiBox Detector. Doi: https://arxiv.org/abs/1512.02325
[29] Pytorch
Foundation, (2016), PyTorch, https://pytorch.org/
[30] Ultralytics (2023), https://github.com/ultralytics/ultralytics
[31] Jeffrey A. Clark. Pillow 10.3.0 (2024). https://pillow.readthedocs.io/en/stable/