cv4lv objective

Computer vision towards functional scene understanding:
Unpacking Activities of Daily Living through AI in Low Vision (LV)

Research Motivation:

Decreases in vision and sensory loss constrain one’s mobility, and subsequently lead to the loss of one’s independence in activities of daily living (ADL) (i.e. eating, bathing, dressing, toileting and transferring), followed by issues with unemployment, quality of life losses, and functional dependencies that limit psychosocial wellbeing. The status quo currently offered to address the ADL needs of those with visual impairment (VI) is to be aided by either individual supervisor who provides with personal guidance according to the VI’s requirements, or special designed tools. For example, the most common assistive device for navigation and mobility of the low vision is the cane. In addition, a package of gadgets and utensils with braille attached aim to prevent the blind from injury in the perilous ADL like cutting or cooking.

In our current era of versatile portable environmental sensors and artificial intelligence (AI) and computer vision (CV) driven agent (i.e. self-driving vehicles), there is a clear need to move beyond current outdated ADL assistant technology, with a paradigm shifting towards modern tools that will allow the visually impaired to regain the mobility losses associated with sensory deprivation, and stop the current downward ‘spiral’ of debility. To this end, the proposed wearable technology solution, Computer Vision for Activities of Daily Living through AI in Low Vision (CV4LV), provides real-time personal security supervision in one’s immediate three-dimensional environment, and action planning/guidance in ADL, revitalizing the virtual environmental perception as well as ADL independence of the LV. More specifically, we aim to 1) design a wearable platform that is able to integrate multi-sensor fusion techniques to effectively combine information obtained from the newly embedded infrared, ultrasound, and stereo-camera-based sensor systems (hardware) that are integrated into the novel smart service system, the CV4LV platform; 2) develop algorithms that are able to perform functional scene understanding (i.e. semantic scene parsing, affordance scene parsing), cyber-human interaction learning (i.e. decoding the relations between human activities and objects in an unconstrained environment), and 3) support cyber-human interaction for ADL of the low vision (LV) by providing the assistive service operating at ‘Default’ mode or ‘User-selective’ mode to meet different user requirements.

Research Objectives

Objective 1: Design the optimal Computer Vision for Activities of Daily Living through AI in Low Vision (CV4LV) platform

The proposed project aims to create an enhanced mobility device, the CV4LV platform, that marries enhanced multi-modal input with advanced vibrotactile & auditory human-computer interfaces. Our first objective will therefore involve integrating the stereo camera and infrared range sensor into our original wearable vest, paying particular attention to maintain usability as a pull-over and to avoid covering any of the currently integrated ultrasound sensors (shown below). In addition, we will integrate a wireless bone-conduction headset with feedback element set of the platform (for auditory as well as vibrotactile feedback channels), connecting this device to the auditory alert system in development and the navigation system that will be a part of the final design. The wearable device consists of 4 main components (shown in Figure): (1) a wearable vest with distinct range and image sensors embedded, tracking the surrounding scene information in order to make decision. Alerts about obstacles and potential hazard are conveyed through (2) a haptic interface (belt) that communicates the spatial information to the end user in real-time via an intuitive, ergonomic and personalized vibrotactile re-display along the torso. (3) a smart phone serves as a connectivity gateway and coordinates the core components through WiFi, bluetooth, and/or 4G LTE, (4) a headset that contains both binaural bone conduction speakers (leaving the ear canal open for ambient sounds) and a microphone for oral communication-based voice recognition during use of a virtual personal assistant (VPA) [under development].

proposaed platform

Objective 2: Design deep adversarial neural networks for functional scene understanding

Task 1: Real-time 3D dynamic scene reconstruction.

Our proposed real-time dense reconstruction of a dynamic scene provides the following primary contributions beyond the state-of-the-art: (1) Utilizes a purely point-based representation throughout the reconstruction process within a robot and a lower-dimensional feature-based approximation representation for efficient information exchange among robots. (2) Employs the Gaussian mixture model (GMM) to model the distribution of the samples from the previous frames to unify RGB-D camera pose estimation and dynamic detection. In the proposed GMM based approach, the RGB-D camera pose estimation problem is converted into minimization of L2 distance between Gaussian mixtures, leading to a computationally efficient registration algorithm. (3) The incre-mental update strategy makes our method naturally applicable to large-scale dynamic scene reconstruction. The main components are shown below.

3D recon pipeline

Task 2: 3D semantic and affordance scene parsing.

To address the problem in traditional scene parsing of insufficient understanding in scene objects functionalities and the potential actions could be applied on the objects, we propose a novel 3D scene parsing model which could perform semantic segmentation and affordance detection in the CV4LV project. The example of semantic segmentation and affordance detection is shown below.

demo of semgments

Objective 3: Design recurrent neural networks and deep reinforcement learning for cyber-human interaction understanding

Task 1: Human daily activity learning through LSTM

In order to learn and understand the sub-activity sequence involved in a specific high-level activity, we proposed a model based on a many-to-many LSTM for interpreting the human actions of video frames over time, and each LSTM module consists of a memory cell and a number of input and output gates, controlling the information flow in a sequence, as well as avoiding important information loss. The figure below shows illustration of learning the action sequence of making a cereal using LSTM.

LSTM

Task 2: Reinforcement learning for action decision strategy learning

With both the sub-activities learnt from LSTM model based on human daily activity, and semantic image and affordance image learnt from scene parsing networks, a deep reinforcement learning machine (DRLM) will be introduced to gather all information extracted for intelligent decision making and action planning. The pipeline is shown in Figure below: (1) the upper left blue box indicates the scene parsing network, determining the state and environmental observation with semantic image and affordance information, (2) the upper right ivory box shows the LSTM model to extract sequence of activities and define actions for further learning, (3) The green box consists of two neural networks for approximating Q-values and policy, selecting the optimal action based on current state, and interacts with end user with audio input/output device.

RL

Objective 4: Design cyber-human interaction studies via CV4LV with ADL cases

The Cyber-Human interaction is a communicating process between the human and the surrounding environment by performing appropriate actions on objects selected in the surroundings to achieve desired effects. In the proposed CV4LV platform, the interrelationship between actions A, objects O and effects E is explored via environmental sensors (described in Objective 1) and deep neural networks (described in Objective 2 and Objective 3), and therefore be utilized to assist the ADL of LV by predicting the effects of an action on objects, to plan actions based on object affordance to achieve a specific goal, or to select the object to produce a certain effect if acted upon in a certain way. We are interested in knowing, 1) given desired “Effects (E)” (i.e. washing dishes) and “Objects (E)” (dishes), our CV4LV platform can plan the “actions” to achieve a goal with a reward, 2) given desired “Effects (E)” and “Actions”, our CV4LV platform can identify the “objects” as the goal, 3) our CV4LV platform can predict the potential “Effects (i.e. finger cut) “ given the “objects (i.e. knife)” and “actions (i.e. cut)”. Our CV4LV platform will operate on two different modes: “Default Mode” and “User-selective Mode” depending on if the “Effects” is predicted by the system or given by the user. Under “Default Mode”, the system actively predicts the potential effects while current actions interact with objects in the scene. The predicted effects (i.e. leg tripped and falling off) will therefore provide the hazard-alert mechanism to the user. Under “User-selective Mode”, the system operates with a given task to plan actions in oder to achieve user-desired “Effects” (i.e. making a meal). The plan of actions are conveyed to the end user through audio.

RL

Project Management Plan

The project management plan is shown in Figure below. We will have one Project Specific lab meeting per week alternating between NYULMC and NYUTandon, and web conferencing will be set up for off-site participants. Graduate students will join our weekly meetings at the 1VMIL and Visual Computing Lab meeting. We will have in-person meetings once per quarter at NYULMC and hold open office hours for our graduate students at any time.

RL