2024-04-23

LOG: Active Model Adaptation for Label-Efficient OOD Generalization

Accepted by: NeurIPS-2022

Presenter: Fengchun Qiao

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99166091629

This work discusses how to achieve worst-case Out-Of-Distribution (OOD) generalization for a variety of distributions based on a relatively small labeling cost. The problem has broad applications, especially in non-i.i.d. open-world scenarios. Previous studies either rely on a large amount of labeling cost or lack of guarantees about the worst-case generalization. In this work, we show for the first time that active model adaptation could achieve both good performance and robustness based on the invariant risk minimization principle. We propose \textsc{Log}, an interactive model adaptation framework, with two sub-modules: active sample selection and causal invariant learning. Specifically, we formulate the active selection as a mixture distribution separation problem and present an unbiased estimator, which could find the samples that violate the current invariant relationship, with a provable guarantee. The theoretical analysis supports that both sub-modules contribute to generalization. A large number of experimental results confirm the promising performance of the new algorithm.

2024-04-09

A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

Submitted to: arXiv-2024

Presenter: Qitong Wang

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99166091629

Slides: link

The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model's capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Extensive experiments indicate that GPT-4V achieves SOTA performance on above three tasks. Interestingly, we find that: a) GPT-4V demonstrates enhanced reasoning and explanation when using composite images as few-shot; b) GPT-4V produces severe hallucinations when dealing with world knowledge, highlighting the future need for advancements in this research direction.

2024-04-02

Improve LMMs from the data perspective

Presenter: Jeffrey Peng

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99166091629

Slides: link

Related Paper 1: Submitted to avXiv, 2024.

Related Paper 2: Accepted by CVPR, 2024.

While LISA effectively bridges the gap between segmentation and large language models to enable reasoning segmentation, it poses certain limitations: unable to distinguish different instances of the target region, and constrained by the pre-defined textual response formats. In this work, we introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact. The main enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation. \textbf{2) More Natural Conversation}: Improved capability for multi-turn dialogue, with the ability to incorporate segmentation results directly into text responses, i.e., Segmentation in Dialogue (SiD). These improvements are achieved by curating the existing samples of generic segmentation datasets, aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources. Comparative analysis with the original LISA model shows significant advancements in these areas, positioning LISA++ as a notable upgrade in visual understanding and interaction. LISA++'s adaptability and improved features highlight the versatility of the mask-as-embedding paradigm proposed by LISA, and the potential as a foundational model for diverse applications.
Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time.

2024-03-26

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Accepted by: ICLR-2024 (Spotlight)

Presenter: Tang Li

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99166091629

Code: https://github.com/wusize/CLIPSelf

Slides: link

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.

2024-03-12

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

Accepted by: npj Digital Medicine-2021

Presenter: Ricardo Santos

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99166091629

Slides: link

Deep learning (DL)-based predictive models from electronic health records (EHRs) deliver impressive performance in many clinical tasks. Large training cohorts, however, are often required by these models to achieve high accuracy, hindering the adoption of DL-based models in scenarios with limited training data. Recently, bidirectional encoder representations from transformers (BERT) and related models have achieved tremendous successes in the natural language processing domain. The pretraining of BERT on a very large training corpus generates contextualized embeddings that can boost the performance of models trained on smaller datasets. Inspired by BERT, we propose Med-BERT, which adapts the BERT framework originally developed for the text domain to the structured EHR domain. Med-BERT is a contextualized embedding model pretrained on a structured EHR dataset of 28,490,650 patients. Fine-tuning experiments showed that Med-BERT substantially improves the prediction accuracy, boosting the area under the receiver operating characteristics curve (AUC) by 1.21–6.14% in two disease prediction tasks from two clinical databases. In particular, pretrained Med-BERT obtains promising performances on tasks with small fine-tuning training sets and can boost the AUC by more than 20% or obtain an AUC as high as a model trained on a training set ten times larger, compared with deep learning models without Med-BERT. We believe that Med-BERT will benefit disease prediction studies with small local training datasets, reduce data collection expenses, and accelerate the pace of artificial intelligence aided healthcare.

2024-02-27

Combining Diverse Feature Priors

Accepted by: ICML-2022

Presenter: Fengchun Qiao

Time: 4:00-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/99166091629

Slides: link

To improve model generalization, model designers often restrict the features that their models use, either implicitly or explicitly. In this work, we explore the design space of leveraging such feature priors by viewing them as distinct perspectives on the data. Specifically, we find that models trained with diverse sets of explicit feature priors have less overlapping failure modes, and can thus be combined more effectively. Moreover, we demonstrate that jointly training such models on additional (unlabeled) data allows them to correct each other’s mistakes, which, in turn, leads to better generalization and resilience to spurious correlations.

2024-02-08

Interpreting CLIP's Image Representation via Text-Based Decomposition

Accepted by: ICLR-2024 (Oral)

Presenter: Tang Li

Time: 4:00-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/99817644758

Project Page: https://yossigandelsman.github.io/clip_decomposition/

Code: https://github.com/yossigandelsman/clip_prs

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g.~location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that scalable understanding of transformer models is attainable and can be used to repair and improve models.

2024-01-23

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Submitted to: arXiv-2021

Presenter: Kien Nguyen

Time: 4:00-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/97743603309

Project Page: https://people.eecs.berkeley.edu/~angelopoulos/blog/posts/gentle-intro/

Code: https://github.com/aangelopoulos/conformal-prediction

Black-box machine learning models are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Conformal prediction is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions. One can use conformal prediction with any pre-trained model, such as a neural network, to produce sets that are guaranteed to contain the ground truth with a user-specified probability, such as 90%. It is easy-to-understand, easy-to-use, and general, applying naturally to problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed to provide the reader a working understanding of conformal prediction and related distribution-free uncertainty quantification techniques with one self-contained document. We lead the reader through practical theory for and examples of conformal prediction and describe its extensions to complex machine learning tasks involving structured outputs, distribution shift, time-series, outliers, models that abstain, and more. Throughout, there are many explanatory illustrations, examples, and code samples in Python. With each code sample comes a Jupyter notebook implementing the method on a real-data example; the notebooks can be accessed and easily run using our codebase.

2023-12-12

FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding

Accepted by: CVPR-2020

Presenter: Ziyang Jia

Time: 4:00-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/98953560317

Slides: link

On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnastic videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jump-hop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.

2023-11-28

Explain Any Concept: Segment Anything Meets Concept-Based Explanation

Accepted by: NeurIPS-2023

Presenter: Qitong Wang

Time: 4:00-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/98953560317

Slides: link

Code: https://github.com/Jerry00917/samshap/tree/main

EXplainable AI (XAI) is an essential topic to improve human understanding of deep neural networks (DNNs) given their black-box internals. For computer vision tasks, mainstream pixel-based XAI methods explain DNN decisions by identifying important pixels, and emerging concept-based XAI explore forming explanations with concepts (e.g., a head in an image). However, pixels are generally hard to interpret and sensitive to the imprecision of XAI methods, whereas "concepts" in prior works require human annotation or are limited to pre-defined concept sets. On the other hand, driven by large-scale pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promotable framework for performing precise and comprehensive instance segmentation, enabling automatic preparation of concept sets from a given image. This paper for the first time explores using SAM to augment concept-based XAI. We offer an effective and flexible concept-based explanation method, namely Explain Any Concept (EAC), which explains DNN decisions with any concept. While SAM is highly effective and offers an "out-of-the-box" instance segmentation, it is costly when being integrated into defacto XAI pipelines. We thus propose a lightweight per-input equivalent (PIE) scheme, enabling efficient explanation with a surrogate model. Our evaluation over two popular datasets (ImageNet and COCO) illustrate the highly encouraging performance of EAC over commonly-used XAI methods.

2023-10-24

Explaining machine learning models with interactive natural language conversations using TalkToModel

Accepted by: Nature Machine Intelligence-2023

Presenter: Meng Ma

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Practitioners increasingly use machine learning (ML) models, yet models have become more complex and harder to understand. To understand complex models, researchers have proposed techniques to explain model predictions. However, practitioners struggle to use explainability methods because they do not know which explanation to choose and how to interpret the explanation. Here we address the challenge of using explainability methods by proposing TalkToModel: an interactive dialogue system that explains ML models through natural language conversations. TalkToModel consists of three components: an adaptive dialogue engine that interprets natural language and generates meaningful responses; an execution component that constructs the explanations used in the conversation; and a conversational interface. In real-world evaluations, 73% of healthcare workers agreed they would use TalkToModel over existing systems for understanding a disease prediction model, and 85% of ML professionals agreed TalkToModel was easier to use, demonstrating that TalkToModel is highly effective for model explainability.

2023-10-17

Resolving Interference When Merging Models

Accepted by: NeurIPS-2023

Presenter: Fengchun Qiao

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Slides: link

Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models. To address this, we propose our method, TrIm, Elect Sign & Merge (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of resolving sign interference.

2023-10-10

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Accepted by: ICCV-2023

Presenter: Kien Nguyen

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Slides: link

Project Page: https://whoops-benchmark.github.io/

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities.

2023-09-19

GFPose: Learning 3D Human Pose Prior with Gradient Fields

Accepted by: CVPR-2023

Presenter: Ziyang Jia

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Project Page: https://sites.google.com/view/gfpose/

Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks.

2023-09-12

Disentangling visual and written concepts in CLIP

Accepted by: CVPR-2022

Presenter: Tang Li

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

The CLIP network measures the similarity between natural text and images; in this work, we investigate the entanglement of the representation of word images and natural images in its image encoder. First, we find that the image encoder has an ability to match word images with natural images of scenes described by those words. This is consistent with previous research that suggests that the meaning and the spelling of a word might be entangled deep within the network. On the other hand, we also find that CLIP has a strong ability to match nonsense words, suggesting that processing of letters is separated from processing of their meaning. To explicitly determine whether the spelling capability of CLIP is separable, we devise a procedure for identifying representation subspaces that selectively isolate or eliminate spelling capabilities. We benchmark our methods against a range of retrieval tasks, and we also test them by measuring the appearance of text in CLIP-guided generated images. We find that our methods are able to cleanly separate spelling capabilities of CLIP from the visual processing of natural images.

2023-09-05

X-Pruner: eXplainable Pruning for Vision Transformers

Accepted by: CVPR-2023

Presenter: Qitong Wang

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Slides: link

Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class, thereby leading to inferior performance. To alleviate this problem, we propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion. Specifically, to measure each prunable unit's contribution to predicting each target class, a novel explainability-aware mask is proposed and learned in an end-to-end manner. Then, to preserve the most informative units and learn the layer-wise pruning rate, we adaptively search the layer-wise threshold that differentiates between unpruned and pruned units based on their explainability-aware mask values. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation.

2023-08-29

Personalized Federated Learning with Inferred Collaboration Graphs

Accepted by: ICML-2023

Presenter: Meng Ma

Time: 4:00-6:00 p.m. EDT

Zoom: https://udel.zoom.us/j/91828932942

Code: https://github.com/MediaBrain-SJTU/pFedGraph

Personalized federated learning (FL) aims to collaboratively train a personalized model for each client. Previous methods do not adaptively determine who to collaborate at a fine-grained level, making them difficult to handle diverse data heterogeneity levels and those cases where malicious clients exist. To address this issue, our core idea is to learn a collaboration graph, which models the benefits from each pairwise collaboration and allocates appropriate collaboration strengths. Based on this, we propose a novel personalized FL algorithm, pFedGraph, which consists of two key modules: (1) inferring the collaboration graph based on pairwise model similarity and dataset size at server to promote fine-grained collaboration and (2) optimizing local model with the assistance of aggregated model at client to promote personalization. The advantage of pFedGraph is flexibly adaptive to diverse data heterogeneity levels and model poisoning attacks, as the proposed collaboration graph always pushes each client to collaborate more with similar and beneficial clients. Extensive experiments show that pFedGraph consistently outperforms the other baseline methods across various heterogeneity levels and multiple cases where malicious clients exist.

2023-08-22

"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts

Accepted by: ICML-2023

Presenter: Fengchun Qiao

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/94110688939

Machine learning models frequently experience performance drops under distribution shifts. The underlying cause of such shifts may be multiple simultaneous factors such as changes in data quality, differences in specific covariate distributions, or changes in the relationship between label and features. When a model does fail during deployment, attributing performance change to these factors is critical for the model developer to identify the root cause and take mitigating actions. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.

2023-08-15

Concept Bottleneck Models

Accepted by: ICML-2020

Presenter: Kien Nguyen

Time: 4:00-5:30 p.m. EDT

Slides: link

We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.

2023-05-23

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

Accepted by: AAAI-2023

Presenter: Ziyang Jia

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Code: https://github.com/sunanhe/MKT

Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets.

2023-04-25

Distributionally Robust Post-hoc Classifiers under Prior Shifts

Accepted by: ICLR-2023

Presenter: Fengchun Qiao

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired from a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers.

2023-04-18

Revisiting the Calibration of Modern Neural Networks

Accepted by: NeurIPS-2021

Presenter: Qitong Wang

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.

2023-04-12

Deep Model Reassembly

Accepted by: NeurIPS-2022

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Code: https://github.com/Adamdad/DeRy

In this paper, we explore a novel knowledge-transfer task, termed as Deep Model Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous models pre-trained from distinct sources and with diverse architectures, the goal of DeRy, as its name implies, is to first dissect each model into distinctive building blocks, and then selectively reassemble the derived blocks to produce customized networks under both the hardware resource and performance constraints. Such ambitious nature of DeRy inevitably imposes significant challenges, including, in the first place, the feasibility of its solution. We strive to showcase that, through a dedicated paradigm proposed in this paper, DeRy can be made not only possibly but practically efficiently. Specifically, we conduct the partitions of all pre-trained networks jointly via a cover set optimization, and derive a number of equivalence set, within each of which the network blocks are treated as functionally equivalent and hence interchangeable. The equivalence sets learned in this way, in turn, enable picking and assembling blocks to customize networks subject to certain constraints, which is achieved via solving an integer program backed up with a training-free proxy to estimate the task performance. The reassembled models, give rise to gratifying performances with the user-specified constraints satisfied. We demonstrate that on ImageNet, the best reassemble model achieves 78.6% top-1 accuracy without fine-tuning, which could be further elevated to 83.2% with end-to-end training.

2023-04-04

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Accepted by: ICLR-2023

Presenter: Tang Li

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Project Page: https://socraticmodels.github.io/

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

2023-03-21

Delaunay Component Analysis for Evaluation of Data Representations

Accepted by: ICLR-2022

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/99236256016

Advanced representation learning techniques require reliable and general evaluation methods. Recently, several algorithms based on the common idea of geometric and topological analysis of a manifold approximated from the learned data representations have been proposed. In this work, we introduce Delaunay Component Analysis (DCA) - an evaluation algorithm which approximates the data manifold using a more suitable neighbourhood graph called Delaunay graph. This provides a reliable manifold estimation even for challenging geometric arrangements of representations such as clusters with varying shape and density as well as outliers, which is where existing methods often fail. Furthermore, we exploit the nature of Delaunay graphs and introduce a framework for assessing the quality of individual novel data representations. We experimentally validate the proposed DCA method on representations obtained from neural networks trained with contrastive objective, supervised and generative models, and demonstrate various use cases of our extended single point evaluation framework.

2023-03-07

ReAct: Synergizing Reasoning and Acting in Language Models

Accepted by: ICLR-2023

Presenter: Amani Arman Kiruga

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Project Page: https://react-lm.github.io/

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.

2023-03-01

Leveraging Domain Relations for Domain Generalization

Submitted to: arXiv-2023

Presenter: Fengchun Qiao

Time: 4:30-6:00 p.m. EST

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Distribution shift is a major challenge in machine learning, as models often perform poorly during the test stage if the test distribution differs from the training distribution. In this paper, we focus on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called D^3G. Unlike previous approaches that aim to learn a single model that is domain invariant, D^3G learns domain-specific models by leveraging the relations among different domains. Concretely, D^3G learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly derived or learned from fixed domain meta-data. Under mild assumptions, we theoretically proved that using domain relations to reweight training-domain-specific functions achieves stronger generalization compared to averaging them. Empirically, we evaluated the effectiveness of D^3G using both toy and real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein interaction prediction. Our results showed that D^3G consistently outperformed state-of-the-art methods, with an average improvement of 10.6% in performance.

2023-02-21

Agree to Disagree: Diversity through Disagreement for Better Transferability

Accepted by: ICLR-2023

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Gradient-based learning algorithms have an implicit \emph{simplicity bias} which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features --- present in the training data but absent from the test data --- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution---referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.

2023-02-14

Diagnosing and Rectifying Vision Models using Language

Accepted by: ICLR-2023

Presenter: Tang Li

Time: 4:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/99236256016

Slides: link

Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.

2023-02-07

MBW: Multi-view Bootstrapping in the Wild

Accepted by: NeurIPS-2022

Presenter: Ziyang Jia

Time: 4:00-5:00 p.m. EST

Code: https://github.com/mosamdabhi/MBW

Zoom: https://udel.zoom.us/j/99236256016

Labeling articulated objects in unconstrained settings have a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled. The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios. In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates from videos with only two or three uncalibrated, handheld cameras. With just a few annotations (representing 1-2% of the frames), we are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impressive results on standard human datasets, as well as tigers, cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We release the codebase for MBW as well as this challenging zoo dataset consisting image frames of tail-end distribution categories with their corresponding 2D, 3D labels generated from minimal human intervention.

2023-01-17

Conformal Time-Series Forecasting

Accepted by: NeurIPS-2021

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/99637309310

Slides: link

Current approaches for multi-horizon time series forecasting using recurrent neural networks (RNNs) focus on issuing point estimates, which is insufficient for decision-making in critical application domains where an uncertainty estimate is also required. Existing approaches for uncertainty quantification in RNN-based time-series forecasts are limited as they may require significant alterations to the underlying model architecture, may be computationally complex, may be difficult to calibrate, may incur high sample complexity, and may not provide theoretical guarantees on frequentist coverage. In this paper, we extend the inductive conformal prediction framework to the time-series forecasting setup, and propose a lightweight algorithm to address all of the above limitations, providing uncertainty estimates with theoretical guarantees for any multi-horizon forecast predictor and any dataset with minimal exchangeability assumptions. We demonstrate the effectiveness of our approach by comparing it with existing benchmarks on a variety of synthetic and real-world datasets.

2022-12-20

Survey of Diffusion Model

Presenter: Qitong Wang

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/94085069142

Slides: link

Related Paper 1: Accepted by NeurIPS, 2022.

Related Paper 2: Submitted to avXiv, 2022.

Related Paper 3: Accepted by CVPR, 2022.

Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).

2022-12-13

Probable Domain Generalization via Quantile Risk Minimization

Accepted by: NeurIPS-2022

Presenter: Fengchun Qiao

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/94085069142

Slides: link

Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the α-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability α. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a generalization bound for EQRM; and (ii) that EQRM recovers the causal predictor as α→1. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG, and demonstrate that EQRM outperforms state-of-the-art baselines on CMNIST and several datasets from WILDS and DomainBed.

2022-11-29

Localizing Visual Sounds the Hard Way

Accepted by: CVPR-2021

Presenter: Amani Arman Kiruga

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/94085069142

Slides: link

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.

2022-11-15

A Brief Survey on Explaniable AI

Presenter: Qiren Wang

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/96357119276

Machine Learning had form many years, and radically changed the world especially the past decade. Machine learning models now achieve a high accuracy when they are asked to do predictions. However, the result from these high accuracy models can not be interpreted. As people would like to know how these result come when they try to make a right decision. Then, explainable AI come into researchers’ sight.

2022-11-01

On the Strong Correlation Between Model Invariance and Generalization

Accepted by: NeurIPS-2022

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96357119276

Slides: link

Generalization and invariance are two essential properties of machine learning models. Generalization captures a model’s ability to classify unseen data while invariance measures consistency of model predictions on transformations of the data. Existing research suggests a positive relationship: a model generalizing well should be invariant to certain visual factors. Building on this qualitative implication we make two contributions. First, we introduce effective invariance (EI), a simple and reasonable measure of model invariance which does not rely on image labels. Given predictions on a test image and its transformed version, EI measures how well the predictions agree and with what level of confidence. Second, using invariance scores computed by EI, we perform large-scale quantitative correlation studies between generalization and invariance, focusing on rotation and grayscale transformations. From a model-centric view, we observe generalization and invariance of different models exhibit a strong linear relationship, on both in-distribution and out-of-distribution datasets. From a dataset-centric view, we find a certain model’s accuracy and invariance linearly correlated on different test sets. Apart from these major findings, other minor but interesting insights are also discussed.

2022-10-18

Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation

Accepted by: CVPR-2020

Presenter: Tang Li

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96357119276

Slides: link

Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online.

2022-10-11

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96357119276

2022-10-04

Conditional Prompt Learning for Vision-Language Models

Accepted by: CVPR-2022

Presenter: Qitong Wang

Time: 3:30-5:00 p.m. EDT

Code: https://github.com/KaiyangZhou/CoOp

Zoom: https://udel.zoom.us/j/96357119276

Slides: link

Related Paper 1: "CLIP", ICML, 2021.

Related Paper 2: "CoOp", IJCV, 2022.

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well.

2022-09-27

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Accepted by: NeurIPS-2021

Presenter: Amani Arman Kiruga

Time: 3:30-5:00 p.m. EDT

Code: https://github.com/DASH-Lab/VFP290K

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

Detection of fallen persons due to, for example, health problems, violence, or accidents, is a critical challenge. Accordingly, detection of these anomalous events is of paramount importance for a number of applications, including but not limited to CCTV surveillance, security, and health care. Given that many detection systems rely on a comprehensive dataset comprising fallen person images collected under diverse environments and in various situations is crucial. However, existing datasets are limited to only specific environmental conditions and lack diversity. To address the above challenges and help researchers develop more robust detection systems, we create a novel, large-scale dataset for the detection of fallen persons composed of fallen person images collected in various real-world scenarios, with the support of the South Korean government. Our Vision-based Fallen Person (VFP290K) dataset consists of 294,713 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations. We empirically demonstrate the effectiveness of the features through extensive experiments analyzing the performance shift based on object detection models. In addition, we evaluate our VFP290K dataset with properly divided versions of our dataset by measuring the performance of fallen person detecting systems. We ranked first in the first round of the anomalous behavior recognition track of AI Grand Challenge 2020, South Korea, using our VFP290K dataset, which can be found here. Our achievement implies the usefulness of our dataset for research on fallen person detection, which can further extend to other applications, such as intelligent CCTV or monitoring systems. The data and more up-to-date information have been provided at our VFP290K site.

2022-09-20

General Multi-label Image Classification with Transformers

Accepted by: CVPR-2021

Presenter: Ziyang Jia

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96016065684

Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the uncertainty of labels during training, it is more general by allowing us to produce improved results for images with partial or extra label annotations during inference. We demonstrate this additional capability in the COCO, Visual Genome, News500, and CUB image datasets.

2022-09-13

Generalizing to Evolving Domains with Latent Structure-Aware Sequential Autoencoder

Accepted by: ICML-2022

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96016065684

Domain generalization aims to improve the generalization capability of machine learning systems to out-of-distribution (OOD) data. Existing domain generalization techniques embark upon stationary and discrete environments to tackle the generalization issue caused by OOD data. However, many real-world tasks in non-stationary environments (e.g., self-driven car system, sensor measures) involve more complex and continuously evolving domain drift, which raises new challenges for the problem of domain generalization. In this paper, we formulate the aforementioned setting as the problem of evolving domain generalization. Specifically, we propose to introduce a probabilistic framework called Latent Structure-aware Sequential Autoencoder (LSSAE) to tackle the problem of evolving domain generalization via exploring the underlying continuous structure in the latent space of deep neural networks, where we aim to identify two major factors namely covariate shift and concept shift accounting for distribution shift in non-stationary environments. Experimental results on both synthetic and real-world datasets show that LSSAE can lead to superior performances based on the evolving domain generalization setting.

2022-09-06

Interpretations are useful: penalizing explanations to align neural networks with prior knowledge

Accepted by: ICML-2020

Presenter: Tang Li

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods in order to increase the predictive accuracy of deep learning models. In particular, when shown that a model has incorrectly assigned importance to some features, CDEP enables practitioners to correct these errors by directly regularizing the provided explanations. Using explanations provided by contextual decomposition (CD) (Murdoch et al., 2018), we demonstrate the ability of our method to increase performance on an array of toy and real datasets.

2022-08-30

Self-Supervised Learning Disentangled Group Representation as Feature

Accepted by: NeurIPS-2021

Presenter: Fengchun Qiao

Time: 3:30-5:00 p.m. EDT

Code: https://github.com/Wangt-CN/IP-IRM

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of "good" representation from a group-theoretic view using Higgins' definition of disentangled representation, and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks.

2022-08-23

Federated Multi-Task Learning under a Mixture of Distributions

Accepted by: NeurIPS-2021

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.

2022-08-16

Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos

Accepted by: CVPR-2022

Presenter: Qitong Wang

Time: 12:30-2:00 p.m. EDT

Project Page: https://stevenlsw.github.io/hoi-forecast/

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

We propose to forecast future hand-object interactions given an egocentric video. Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object (i.e., interaction hotspots). This relatively low-dimensional representation provides a concrete description of future interactions. To tackle this task, we first provide an automatic way to collect trajectory and hotspots labels on large-scale data. We then use this data to train an Object-Centric Transformer (OCT) model for prediction. Our model performs hand and object interaction reasoning via the self-attention mechanism in Transformers. OCT also provides a probabilistic framework to sample the future trajectory and hotspots to handle uncertainty in prediction. We perform experiments on the Epic-Kitchens-55, Epic-Kitchens-100, and EGTEA Gaze+ datasets, and show that OCT significantly outperforms state-of-the-art approaches by a large margin.

2022-08-10

Detecting Moments and Highlights in Videos via Natural Language Queries

Accepted by: NeurIPS-2021

Presenter: Amani Arman Kiruga

Time: 1:00-2:00 p.m. EDT

Data & Code: https://github.com/jayleicn/moment_detr

Zoom: https://udel.zoom.us/j/96016065684

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR.

2022-08-03

A Brief Review on Transformer-Based Video Captioning Tasks

Presenter: Ziyang Jia

Time: 1:00-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/96016065684

Slides: link

As a connection between the two worlds of vision (CV) and language (NLP), video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. The task is naturally decomposed into two sub-tasks. One is to encode a video via a thorough understanding and learn visual representation. The other is caption generation, which decodes the learned representation into a sequential sentence, word by word.

2022-07-06 & 27

Contrastive Test-Time Adaptation

Accepted by: CVPR-2022

Presenter: Kien Nguyen

Time: 1:00-2:00 p.m. EDT

Project Page: https://sites.google.com/view/adacontrast

Zoom: https://udel.zoom.us/j/92841583823

Slides: link

Test-time adaptation is a special setting of unsupervised domain adaptation where a trained model on the source domain has to adapt to the target domain without accessing source data. We propose a novel way to leverage self-supervised contrastive learning to facilitate target feature learning, along with an online pseudo labeling scheme with refinement that significantly denoises pseudo labels. The contrastive learning task is applied jointly with pseudo labeling, contrasting positive and negative pairs constructed similarly as MoCo but with source-initialized encoder, and excluding same-class negative pairs indicated by pseudo labels. Meanwhile, we produce pseudo labels online and refine them via soft voting among their nearest neighbors in the target feature space, enabled by maintaining a memory queue. Our method, AdaContrast, achieves state-of-the-art performance on major benchmarks while having several desirable properties compared to existing works, including memory efficiency, insensitivity to hyper-parameters, and better model calibration.

2022-06-29

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Accepted by: ICML-2021

Presenter: Tang Li

Time: 1:00-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/92841583823

Slides: link

Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations.

2022-06-15

Graph-Based Continual Learning

Accepted by: ICLR-2021

Presenter: Fengchun Qiao

Time: 1:00-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/92841583823

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

2022-06-08

Exponential Graph is Provably Efficient for Decentralized Deep Training

Accepted by: NeurIPS-2021

Presenter: Meng Ma

Time: 1:00-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/92841583823

Slides: link

Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SGD efficient is to realize nearly-exact averaging using little communication. This requires a skillful choice of communication topology, which is an under-studied topic in decentralized optimization. In this paper, we study so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes. This work proves such graphs can lead to both fast communication and effective averaging simultaneously. We also discover that a sequence of log(n) one-peer exponential graphs, in which each node communicates to one single neighbor per iteration, can together achieve exact averaging. This favorable property enables one-peer exponential graph to average as effective as its static counterpart but communicates more efficiently. We apply these exponential graphs in decentralized (momentum) SGD to obtain the state-of-the-art balance between per-iteration communication and iteration complexity among all commonly-used topologies. Experimental results on a variety of tasks and models demonstrate that decentralized (momentum) SGD over exponential graphs promises both fast and high-quality training.

2022-06-01

Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Accpeted by: ECCV-2022

Presenter: Qitong Wang

Time: 1:00-2:00 p.m. EDT

Project Page: https://yuliang.vision/InstCal/

Zoom: https://udel.zoom.us/j/92841583823

Slides: link

We propose a test-time adaptation method for cross-domain image segmentation. Our method is simple: Given a new unseen instance at test time, we adapt a pre-trained model by conducting instance-specific BatchNorm (statistics) calibration. Our approach has two core components. First, we replace the manually designed BatchNorm calibration rule with a learnable module. Second, we leverage strong data augmentation to simulate random domain shifts for learning the calibration rule. In contrast to existing domain adaptation methods, our method does not require accessing the target domain data at training time or conducting computationally expensive test-time model training/optimization. Equipping our method with models trained by standard recipes achieves significant improvement, comparing favorably with several state-of-the-art domain generalization and one-shot unsupervised domain adaptation approaches. Combining our method with the domain generalization methods further improves performance, reaching a new state of the art.

2022-05-25

Do Feature Attribution Methods Correctly Attribute Features?

Accepted by: AAAI-2022

Presenter: Ziyang Jia

Time: 1:30-3:00 p.m. EDT

Zoom: https://udel.zoom.us/j/94384898933

Feature attribution methods are popular in interpretable machine learning. These methods compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of "attribution", leading to many competing methods with little systematic evaluation, complicated in particular by the lack of ground truth attribution. To address this, we propose a dataset modification procedure to induce such ground truth. Using this procedure, we evaluate three common methods: saliency maps, rationales, and attentions. We identify several deficiencies and add new perspectives to the growing body of evidence questioning the correctness and reliability of these methods applied on datasets in the wild. We further discuss possible avenues for remedy and recommend new attribution methods to be tested against ground truth before deployment.

2022-04-26

Explainable Deep Classification Models for Domain Generalization

Accepted by: CVPR-W-2021

Presenter: Tang Li

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network's decision. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We quantify explainability using an automated metric, and using human judgement. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain.

2022-04-19

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

Accepted by: ICCV-2021

Presenter: Kien Nguyen

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Human pose is a useful feature for fine-grained sports action understanding. However, pose estimators are often unreliable when run on sports video due to domain shift and factors such as motion blur and occlusions. This leads to poor accuracy when downstream tasks, such as action recognition, depend on pose. End-to-end learning circumvents pose, but requires more labels to generalize. We introduce Video Pose Distillation (VPD), a weakly-supervised technique to learn features for new video domains, such as individual sports that challenge pose estimation. Under VPD, a student network learns to extract robust pose features from RGB frames in the sports video, such that, whenever pose is considered reliable, the features match the output of a pretrained teacher pose detector. Our strategy retains the best of both pose and end-to-end worlds, exploiting the rich visual patterns in raw video frames, while learning features that agree with the athletes' pose and motion in the target video domain to avoid over-fitting to patterns unrelated to athletes' motion. VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in four real-world sports video datasets, without requiring additional ground-truth pose annotations.

2022-04-12

Environment Inference for Invariant Learning

Accepted by: ICML-2021

Presenter: Fengchun Qiao

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Learning models that gracefully handle distribution shifts is central to research on domain generalization, robust optimization, and fairness. A promising formulation is domain-invariant learning, which identifies the key issue of learning which features are domain-specific versus domain-invariant. An important assumption in this area is that the training examples are partitioned into "domains" or "environments". Our focus is on the more common setting where such partitions are not provided. We propose EIIL, a general framework for domain-invariant learning that incorporates Environment Inference to directly infer partitions that are maximally informative for downstream Invariant Learning. We show that EIIL outperforms invariant learning methods on the CMNIST benchmark without using environment labels, and significantly outperforms ERM on worst-group performance in the Waterbirds and CivilComments datasets. Finally, we establish connections between EIIL and algorithmic fairness, which enables EIIL to improve accuracy and calibration in a fair prediction problem.

2022-04-05

Learning with Noisy Correspondence for Cross-modal Matching

Accepted by: NeurIPS-2021

Presenter: Meng Ma

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Cross-modal matching, which aims to establish the correspondence between two different modalities, is fundamental to a variety of tasks such as cross-modal retrieval and vision-and-language understanding. Although a huge number of cross-modal matching methods have been proposed and achieved remarkable progress in recent years, almost all of these methods implicitly assume that the multimodal training data are correctly aligned. In practice, however, such an assumption is extremely expensive even impossible to satisfy. Based on this observation, we reveal and study a latent and challenging direction in cross-modal matching, named noisy correspondence, which could be regarded as a new paradigm of noisy labels. Different from the traditional noisy labels which mainly refer to the errors in category labels, our noisy correspondence refers to the mismatch paired samples. To solve this new problem, we propose a novel method for learning with noisy correspondence, named Noisy Correspondence Rectifier (NCR). In brief, NCR divides the data into clean and noisy partitions based on the memorization effect of neural networks and then rectifies the correspondence via an adaptive prediction model in a co-teaching manner. To verify the effectiveness of our method, we conduct experiments by using the image-text matching as a showcase. Extensive experiments on Flickr30K, MS-COCO, and Conceptual Captions verify the effectiveness of our method. The code could be accessed from www.pengxi.me .

2022-03-22

Neighborhood Contrastive Learning for Novel Class Discovery

Accepted by: CVPR-2021

Presenter: Qitong Wang

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).

2022-03-15

Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A Geometric Approach

Accepted by: CVPR-2020

Presenter: Ziyang Jia

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs. It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space. We present a geometric approach to reinforce the visual features of each pair of joints based on the IMUs. This notably improves 2D pose estimation accuracy especially when one joint is occluded. We call this approach Orientation Regularized Network (ORN). Then we lift the multi-view 2D poses to the 3D space by an Orientation Regularized Pictorial Structure Model (ORPSM) which jointly minimizes the projection error between the 3D and 2D poses, along with the discrepancy between the 3D pose and IMU orientations. The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset.

2022-03-01

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Accepted by: CVPR-2020

Presenter: Amani Arman Kiruga

Time: 12:30-2:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

2022-02-22

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Accepted by: ECCV-2020

Presenter: Nathaniel Merrill

Time: 12:30-2:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

2022-02-15

FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction

Submitted to: arXiv-2021

Presenter: Kien Nguyen

Time: 1:00-2:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

The increasing availability of video recordings made by multiple cameras has offered new means for mitigatingocclusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters; particularly, the relativepositions between the cameras. Such a dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end parameter-free multi-viewmodel. FLEX is parameter-free in the sense that it does not require any camera parameters, neither intrinsic nor extrinsic. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on the Human3.6M and KTH Multi-view Football II datasets, and on synthetic multi-person video streams captured by dynamic cameras. We compare our model to state-of-the-art methods that are not parameter-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, video examples, and more material will be available on our project page.

2022-02-08

A method to evaluate task-specific importance of spatio-temporal units based on explainable artificial intelligence

Accepted by: International Journal of Geographical Information Science, 2020.

Presenter: Tang Li

Time: 12:30-2:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Big geo-data are often aggregated according to spatio-temporal units for analyzing human activities and urban environments. Many applications categorize such data into groups and compare the characteristics across groups. The intergroup differences vary with spatio-temporal units, and the essential is to identify the spatio-temporal units with apparently different data characteristics. However, spatio-temporal dependence, data variety, and the complexity of tasks impede an effective unit assessment. Inspired by the applications to extract critical image components based on explainable artificial intelligence (XAI), we propose a spatio-temporal layer-wise relevance propagation method to assess spatio-temporal units as a general solution. The method organizes input data into an extensible three-dimensional tensor form. We provide two means of labeling the spatio-temporal tensor data for typical geographical applications, using temporally or spatially relevant information. Neural network training proceeds to extract the global and local characteristics of data for corresponding analytical tasks. Then the method propagates classification results backward into units as obtained task-specific importance. A case study with taxi trajectory data in Beijing validates the method. The results prove that the proposed method can evaluate the task-specific importance of spatio-temporal units with dependence. This study also attempts to discover task-related knowledge using XAI.

2022-02-03

Continual Adaptation of Visual Representations via Domain Randomization and Meta-learning

Accepted by: CVPR-2021

Presenter: Fengchun Qiao

Time: 7:00-8:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Most standard learning approaches lead to fragile models which are prone to drift when sequentially trained on samples of a different nature - the well-known "catastrophic forgetting" issue. In particular, when a model consecutively learns from different visual domains, it tends to forget the past domains in favor of the most recent ones. In this context, we show that one way to learn models that are inherently more robust against forgetting is domain randomization - for vision tasks, randomizing the current domain's distribution with heavy image manipulations. Building on this result, we devise a meta-learning strategy where a regularizer explicitly penalizes any loss associated with transferring the model from the current domain to different "auxiliary" meta-domains, while also easing adaptation to them. Such meta-domains are also generated through randomized image manipulations. We empirically demonstrate in a variety of experiments - spanning from classification to semantic segmentation - that our approach results in models that are less prone to catastrophic forgetting when transferred to new domains.

2022-01-27

Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose Estimation

Accepted by: NeurIPS-2021

Presenter: Meng Ma

Time: 7:00-8:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Available 3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision. Barring synthetic or in-studio domains, acquiring such supervision for each new target environment is highly inconvenient. To this end, we cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target. We propose to infer image-to-pose via two explicit mappings viz. image-to-latent and latent-to-pose where the latter is a pre-learned decoder obtained from a prior-enforcing generative adversarial auto-encoder. Next, we introduce relation distillation as a means to align the unpaired cross-modal samples i.e., the unpaired target videos and unpaired 3D pose sequences. To this end, we propose a new set of non-local relations in order to characterize long-range latent pose interactions, unlike general contrastive relations where positive couplings are limited to a local neighborhood structure. Further, we provide an objective way to quantify non-localness in order to select the most effective relation set. We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.

2022-01-20

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Accepted by: CVPR-2021

Presenter: Qitong Wang

Time: 7:00-8:00 p.m. EST

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

2021-12-08

A Recent Trend on Contrastive Learning

Presenter: Zhenzhu Zheng

Time: 8:00-9:00 p.m. EST

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Contrastive Learning has recently received interest due to its rapid success in self-supervised representation learning. This presentation first reviews the recent work in the area of contrastive learning. Although contrastive learning methods show significant progress on large model training, it does not work well for small models. This brings us to another area known as Knowledge Distillation. Although most previous studies on knowledge distillation are in supervised settings, this presentation further discusses the recent trend of knowledge distillation merged with contrast learning. This motivates the key questions of the future as to how to learn so much from observation alone. Finally, we discuss possible open problems.

2021-12-01

DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

Submitted to: arXiv-2020

Presenter: Nathaniel Merrill

Time: 8:00-9:00 p.m. EST

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation, thus preserving accurate shapes of a scene. Previous methods that predict metric depth often work well only for a specific scene. In contrast, learning relative depth (information of being closer or further) can enjoy better generalization, with the price of failing to recover the accurate geometric shape of the scene. In this work, we propose a dataset and methods to tackle this dilemma, aiming to predict accurate depth up to an affine transformation with good generalization to diverse scenes. First we construct a large-scale and diverse dataset, termed Diverse Scene Depth dataset (DiverseDepth), which has a broad range of scenes and foreground contents. Compared with previous learning objectives, i.e., learning metric depth or relative depth, we propose to learn the affine-invariant depth using our diverse dataset to ensure both generalization and high-quality geometric shapes of scenes. Furthermore, in order to train the model on the complex dataset effectively, we propose a multi-curriculum learning method. Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting, demonstrating the excellent generalization capacity of the learned model to diverse scenes. The reconstructed point clouds with the predicted depth show that our method can recover high-quality 3D shapes. Code and dataset are available at: https://tinyurl.com/DiverseDepth

2021-11-17

Survey on Out-of-Domain Detection in Deep Learning

Presenter: Wenxuan Li

Time: 7:00-8:00 p.m. EST

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

In this survey, we focus on the out-of-domain detection which in other words, covariate shift detection. First, started with a broad wide topic “out-of-distribution detection” by given a unified framework termed generalized out-of-distribution detection, since out-of-domain detection is a subtask of out-of-distribution detection. Under the given framework, the five problems which include Anomaly Detection, Novelty Detection, Open Set Recognition, Out-of-Distribution Detection, and Outlier Detection, can be viewed as special cases or subtopics. Then clarified the background, definition, and application of each subtopic. Since these five subtopics could be categorized by whether occurs covariate shift or semantic shift or both, and currently, we are only interested in the covariate shift detection. Therefore, we selected the subtopics occurred covariate shift, summarized recent technical developments and categorized existing methods of each subtopic. At the end, a comprehensive paper list of all five subtopics is given.

2021-11-10

Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos

Accepted by: CVPR-2021

Presenter: Shivanand Venkanna Sheshappanavar

Time: 8:00-9:00 p.m. EST

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Point cloud videos exhibit irregularities and lack of order along the spatial dimension where points emerge inconsistently across different frames. To capture the dynamics in point cloud videos, point tracking is usually employed. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult. Moreover, tracking usually relies on point colors and thus may fail to handle colorless point clouds. In this paper, to avoid point tracking, we propose a novel Point 4D Transformer (P4Transformer) network to model raw point cloud videos. Specifically, P4Transformer consists of (i) a point 4D convolution to embed the spatio-temporal local structures presented in a point cloud video and (ii) a transformer to capture the appearance and motion information across the entire video by performing self-attention on the embedded local features. In this fashion, related or similar local areas are merged with attention weight rather than by explicit tracking. Extensive experiments, including 3D action recognition and 4D semantic segmentation, on four benchmarks demonstrate the effectiveness of our P4Transformer for point cloud video modeling.

2021-10-27

ViViT: A Video Vision Transformer

Accepted by: ICCV-2021

Presenter: Kien Nguyen

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

2021-10-20

Volumetric Breast Density Estimation on MRI Using Explainable Deep Learning Regression

Accepted by: Nature: Scientific Reports-2021

Presenter: Tang Li

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

To purpose of this paper was to assess the feasibility of volumetric breast density estimations on MRI without segmentations accompanied with an explainability step. A total of 615 patients with breast cancer were included for volumetric breast density estimation. A 3-dimensional regression convolutional neural network (CNN) was used to estimate the volumetric breast density. Patients were split in training (N = 400), validation (N = 50), and hold-out test set (N = 165). Hyperparameters were optimized using Neural Network Intelligence and augmentations consisted of translations and rotations. The estimated densities were evaluated to the ground truth using Spearman’s correlation and Bland–Altman plots. The output of the CNN was visually analyzed using SHapley Additive exPlanations (SHAP). Spearman’s correlation between estimated and ground truth density was ρ = 0.81 (N = 165, P < 0.001) in the hold-out test set. The estimated density had a median bias of 0.70% (95% limits of agreement = − 6.8% to 5.0%) to the ground truth. SHAP showed that in correct density estimations, the algorithm based its decision on fibroglandular and fatty tissue. In incorrect estimations, other structures such as the pectoral muscle or the heart were included. To conclude, it is feasible to automatically estimate volumetric breast density on MRI without segmentations, and to provide accompanying explanations.

2021-10-13

Efficient Continual Learning with Modular Networks and Task-Driven Prior

Submitted to: ICLR-2021

Presenter: Fengchun Qiao

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. The Benchmark is publicly available at https://github.com/facebookresearch/CTrLBenchmark.

2021-10-06

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Submitted to: ArXiv-2021

Presenter: Meng Ma

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers

2021-09-29

Time-series Generative Adversarial Networks

Accepted to: NeurIPS-2019

Presenter: Pranjal Dhakal

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

A good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between variables across time. Existing methods that bring generative adversarial networks (GANs) into the sequential setting do not adequately attend to the temporal correlations unique to time-series data. At the same time, supervised models for sequence prediction - which allow finer control over network dynamics - are inherently deterministic. We propose a novel framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, we encourage the network to adhere to the dynamics of the training data during sampling. Empirically, we evaluate the ability of our method to generate realistic samples using a variety of real and synthetic time-series datasets. Qualitatively and quantitatively, we find that the proposed framework consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.

2021-09-22

Broaden Your Views for Self-Supervised Video Learning

Accepted to: ICCV-2021

Presenter: Amani Arman Kiruga

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.

2021-09-15

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Accepted to: ICML-2020

Presenter: Qitong Wang

Time: 8:00-9:00 p.m. EDT

Zoom: https://udel.zoom.us/j/95652916677

Slides: link

Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https://github.com/Linear95/CLUB.

2021-05-12

Contrastive Multiview Coding

Accepted to: ArXiv'19

Presenter: Qitong

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video unsupervised learning benchmarks.

2021-05-05

UNet++: A Nested U-Net Architecturefor Medical Image Segmentation

Accepted to: 4th Deep Learning in Medical Image Analysis (DLMIA) Workshop

Presenter: Tang

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

In this paper, we present UNet++, a new, more powerful ar-chitecture for medical image segmentation. Our architecture is essentiallya deeply-supervised encoder-decoder network where the encoder and de-coder sub-networks are connected through a series of nested, dense skippathways. The re-designed skip pathways aim at reducing the semanticgap between the feature maps of the encoder and decoder sub-networks.We argue that the optimizer would deal with an easier learning task whenthe feature maps from the decoder and encoder networks are semanticallysimilar. We have evaluated UNet++ in comparison with U-Net and wideU-Net architectures across multiple medical image segmentation tasks:nodule segmentation in the low-dose CT scans of chest, nuclei segmen-tation in the microscopy images, liver segmentation in abdominal CTscans, and polyp segmentation in colonoscopy videos. Our experimentsdemonstrate that UNet++ with deep supervision achieves an averageIoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

2021-04-21

Few-Shot Adversarial Domain Adaptation

Accepted to: NeurIPS 2017

Presenter: Pranjal

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

This work provides a framework for addressing the problem of supervised domainadaptation with deep models. The main idea is to exploit adversarial learning tolearn an embedded subspace that simultaneously maximizes the confusion betweentwo domains while semantically aligning their embedding. The supervised settingbecomes attractive especially when there are only a few target data samples thatneed to be labeled. In thisfew-shot learningscenario, alignment and separation ofsemantic probability distributions is difficult because of the lack of data. We foundthat by carefully designing a training scheme whereby the typical binary adversarialdiscriminator is augmented to distinguish between four different classes, it ispossible to effectively address the supervised adaptation problem. In addition, theapproach has a high “speed” of adaptation, i.e. it requires an extremely low numberof labeled target training samples, even one per category can be effective. We thenextensively compare this approach to the state of the art in domain adaptation intwo experiments: one using datasets for handwritten digit recognition, and oneusing datasets for visual object recognition.

2021-04-14

Memory-augmented Dense Predictive Coding for Video Representation Learning

Accepted to: ECCV 2020

Presenter: Ziyang

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/95652916677

Slides: Coming soon...

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.

2021-04-07

VIBE: Video Inference for Human Body Pose and Shape Estimation

Accepted to: CVPR 2020

Presenter: Ruochen

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance.

2021-03-31

Perceiver: General Perception with Iterative Attention

Accepted to: arXiv 2021

Presenter: Meng

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Biological systems understand the world by si-multaneously processing high-dimensional inputsfrom modalities as diverse as vision, audition,touch, proprioception, etc. The perception mod-els used in deep learning on the other hand aredesigned for individual modalities, often relyingon domain-specific assumptions such as the localgrid structures exploited by virtually all existingvision models. These priors introduce helpful in-ductive biases, but also lock models to individualmodalities. In this paper we introducethe Per-ceiver– a model that builds upon Transformersand hence makes few architectural assumptionsabout the relationship between its inputs, but thatalso scales to hundreds of thousands of inputs,like ConvNets. The model leverages an asymmet-ric attention mechanism to iteratively distill inputsinto a tight latent bottleneck, allowing it to scale tohandle very large inputs. We show that this archi-tecture performs competitively or beyond strong,specialized models on classification tasks acrossvarious modalities: images, point clouds, audio,video and video+audio. The Perceiver obtains per-formance comparable to ResNet-50 on ImageNetwithout convolutions and by directly attending to50,000 pixels. It also surpasses state-of-the-artresults for all modalities in AudioSet.

2021-03-24

Sensor based Prediction of Human Driving Decisions using Feed forward Neural Networks for Intelligent Vehicles

Accepted to: International Conference on Intelligent Transportation Systems 2018

Presenter: Tanvir

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Prediction of human driving decisions is an important aspect of modeling human behavior for the application to Advanced Driver Assistance Systems (ADAS) in the intelligent vehicles. This paper presents a sensor based receding horizon model for the prediction of human driving commands. Human driving decisions are expressed in terms of the vehicle speed and steering wheel angle profiles. Environmental state and human intention are the two major factors influencing the human driving decisions. The environment around the vehicle is perceived using LIDAR sensor. Feature extractor computes the occupancy grid map from the sensor data which is filtered and processed to provide precise and relevant information to the feed-forward neural network. Human intentions can be identified from the past driving decisions and represented in the form of time series data for the neural network. Supervised machine learning is used to train the neural network. Data collection and model validation is performed in the driving simulator using the SCANeR studio software. Simulation results are presented alone with the analysis.

2021-03-17

Discovery of Latent 3D Keypoints viaEnd-to-end Geometric Reasoning

Accepted to: NeurIPS 2018

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder. The bestperforming models also connect the encoder and decoder through an attentionmechanism. We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely. Experiments on two machine translation tasks show these models tobe superior in quality while being more parallelizable and requiring significantlyless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, includingensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,our model establishes a new single-model state-of-the-art BLEU score of 41.0 aftertraining for 3.5 days on eight GPUs, a small fraction of the training costs of thebest models from the literature.

2021-03-10

Attention Is All You Need

Accepted to: NeurIPS 2017

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

This paper presentsKeypointNet, an end-to-end geometric reasoning framework tolearnan optimal set of category-specific 3D keypoints, along with their detectors. Given a single image, KeypointNet extracts 3D keypoints that are optimized fora downstream task. We demonstrate this framework on 3D pose estimation byproposing a differentiable objective that seeks the optimal set of keypoints forrecovering the relative pose between two views of an object. Our model discovers geometrically and semantically consistent keypoints across viewing angles andinstances of an object category. Importantly, we find that our end-to-end frameworkusing no ground-truth keypoint annotations outperforms a fully supervised baseline using the same neural network architecture on the task of pose estimation. The discovered 3D keypoints on the car, chair, and plane categories of ShapeNet are visualized at keypointnet.github.io

2020-12-16

OOPS! Predicting Unintentional Action in Video

Accepted to: CVPR 2020

Presenter: Tang Li

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

From just a short glance at a video, we can often tell whether a person’s action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the- wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and ana- lyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pre- training. However, a significant gap between machine and human performance remains.

2020-12-09

TLIO: Tight Learned Inertial Odometry

Accepted to: IEEE Robotics and Automation Letters 2020

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

In this work we propose a tightly-coupled Extended Kalman Filter framework for IMU-only state estimation. Strap-down IMU measurements provide relative state estimates based on IMU kinematic motion model. However the integration of measurements is sensitive to sensor bias and noise, causing significant drift within seconds. Recent research by Yan et al. (RoNIN) and Chen et al. (IONet) showed the capability of using trained neural networks to obtain accurate 2D displacement estimates from segments of IMU data and obtained good position estimates from concatenating them. This paper demonstrates a network that regresses 3D displacement estimates and its uncertainty, giving us the ability to tightly fuse the relative state measurement into a stochastic cloning EKF to solve for pose, velocity and sensor biases. We show that our network, trained with pedestrian data from a headset, can produce statistically consistent measurement and uncertainty to be used as the update step in the filter, and the tightly-coupled system outperforms velocity integration approaches in position estimates, and AHRS attitude filter in orientation estimates.

2020-11-18

Graph U-Nets

Accepted to: ICML 2019

Presenter: Pranjal

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

We consider the problem of representation learning for graph data. Convolutional neural networkscan naturally operate on images, but have sig-nificant challenges in dealing with graph data.Given images are special cases of graphs withnodes lie on 2D lattices, graph embedding taskshave a natural correspondence with image pixel-wise prediction tasks such as segmentation. Whileencoder-decoder architectures like U-Nets havebeen successfully applied on many image pixel-wise prediction tasks, similar methods are lack-ing for graph data. This is due to the fact thatpooling and up-sampling operations are not nat-ural on graph data. To address these challenges,we propose novel graph pooling (gPool) and un-pooling (gUnpool) operations in this work. ThegPool layer adaptively selects some nodes to forma smaller graph based on their scalar projectionvalues on a trainable projection vector. We fur-ther propose the gUnpool layer as the inverse op-eration of the gPool layer. The gUnpool layerrestores the graph into its original structure us-ing the position information of nodes selectedin the corresponding gPool layer. Based on ourproposed gPool and gUnpool layers, we developan encoder-decoder model on graph, known asthe graph U-Nets. Our experimental results onnode classification and graph classification tasksdemonstrate that our methods achieve consistentlybetter performance than previous models.

2020-11-11

Occlusion Aware Unsupervised Learning of Optical Flow

Accepted to: CVPR 2018

Presenter: Ziyang Jia

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

It has been recently shown that a convolutional neural network can learn optical flow estimation with unsuper- vised learning. However, the performance of the unsuper- vised methods still has a relatively large gap compared to its supervised counterpart. Occlusion and large motion are some of the major factors that limit the current unsuper- vised learning of optical flow methods. In this work we introduce a new method which models occlusion explicitly and a new warping way that facilitates the learning of large motion. Our method shows promising results on Flying Chairs, MPI-Sintel and KITTI benchmark datasets. Espe- cially on KITTI dataset where abundant unlabeled samples exist, our unsupervised method outperforms its counterpart trained with supervised learning.

2020-11-4

Exploiting temporal information for 3D humanpose estimation

Accepted to: ECCV 2018

Presenter: Ruochen Wang

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

In this work, we address the problem of 3D human pose esti-mation from a sequence of 2D human poses. Although the recent successof deep networks has led many state-of-the-art methods for 3D pose esti-mation to train deep networks end-to-end to predict from images directly,the top-performing approaches have shown the effectiveness of dividingthe task of 3D pose estimation into two steps: using a state-of-the-art 2Dpose estimator to estimate the 2D pose from images and then mappingthem into 3D space. They also showed that a low-dimensional represen-tation like 2D locations of a set of joints can be discriminative enoughto estimate 3D pose with high accuracy. However, estimation of 3D posefor individual frames leads to temporally incoherent estimates due to in-dependent error in each frame causing jitter. Therefore, in this work weutilize the temporal information across a sequence of 2D joint locationsto estimate a sequence of 3D poses. We designed a sequence-to-sequencenetwork composed of layer-normalized LSTM units with shortcut con-nections connecting the input to the output on the decoder side andimposed temporal smoothness constraint during training. We found thatthe knowledge of temporal consistency improves the best reported resulton Human3.6M dataset by approximately 12.2% and helps our networkto recover temporally consistent 3D poses over a sequence of images evenwhen the 2D pose detector fails.

2020-10-28

Multimodal Learning with Incomplete Modalities by Knowledge Distillation

Accepted to: KDD 2020

Presenter: Meng Ma

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Multimodal learning aims at utilizing information from a variety of data modalities to improve the generalization performance. One common approach is to seek the common information that is shared among different modalities for learning, whereas we can also fuse the supplementary information to leverage modality-specific information. Though the supplementary information is often desired, most existing multimodal approaches can only learn from samples with complete modalities, which wastes a considerable amount of data collected. Otherwise, model-based imputation needs to be used to complete the missing values and yet may introduce undesired noise, especially when the sample size is limited. In this paper, we proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities, and avoiding imputation and noise associated with it. Specifically, we first train models on each modality independently using all the available data. Then the trained models are used as teachers to teach the student model, which is trained with the samples having complete modalities. We demonstrate the effectiveness of the proposed method in extensive empirical studies on both synthetic datasets and real-world datasets.

2020-10-21

Introduction to Self-Supervised Representation Learning

Accepted to: tutorial

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Given a task and enough labels, supervised learning can solve it really well. Good performance usually requires a decent amount of labels, but collecting manual labels is expensive (i.e. ImageNet) and hard to be scaled up. Considering the amount of unlabelled data (e.g. free text, all the images on the Internet) is substantially more than a limited number of human curated labelled datasets, it is kinda wasteful not to use them. However, unsupervised learning is not easy and usually works much less efficiently than supervised learning. What if we can get labels for free for unlabelled data and train unsupervised dataset in a supervised manner? We can achieve this by framing a supervised learning task in a special form to predict only a subset of information using the rest. In this way, all the information needed, both inputs and labels, has been provided. This is known as self-supervised learning.

2020-10-14

SinGAN: Learning a Generative Model from a Single Natural Image

Accepted to: ICCV 2019 (Oral)

Presenter: Tang Li

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

2020-10-07

CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM

Accepted to: CVPR 2018

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

The representation of geometry in real-time 3D perception systems continues to be a critical research issue. Dense maps capture complete surface shape and can be augmented with semantic labels, but their high dimensionality makes them computationally costly to store and process, and unsuitable for rigorous probabilistic inference. Sparse feature-based representations avoid these problems, but capture only partial scene information and are mainly useful for localisation only. We present a new compact but dense representation of scene geometry which is conditioned on the intensity data from a single image and generated from a code consisting of a small number of parameters. We are inspired by work both on learned depth from images, and auto-encoders. Our approach is suitable for use in a keyframe-based monocular dense SLAM system: While each keyframe with a code can produce a depth map, the code can be optimised efficiently jointly with pose variables and together with the codes of overlapping keyframes to attain global consistency. Conditioning the depth map on the image allows the code to only represent aspects of the local geometry which cannot directly be predicted from the image. We explain how to learn our code representation, and demonstrate its advantageous properties in monocular SLAM.

2020-09-30

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Accepted to: ACM MM 2019

Presenter: Hamed

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

2020-09-23

Introduction to Graph Convolution Networks

Accepted to: None

Presenter: Pranjal Dhakal

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

It will consist of:

Code and notebooks are in this Github repo.

2020-09-16

Robust Learning Through Cross-Task Consistency

Accepted to: CVPR 2020 (Oral)

Presenter: Ziyang Jia

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Visual perception entails solving a wide set of tasks, e.g.,object detection, depth estimation, etc. The predictions madefor multiple tasks from the same image are not independent,and therefore, are expected to be ‘consistent’. We propose abroadly applicable and fully computational method for aug-menting learning withCross-Task Consistency.1The pro-posed formulation is based oninference-path invarianceover a graph of arbitrary tasks. We observe that learningwith cross-task consistency leads to more accurate predic-tions and better generalization to out-of-distribution inputs.This framework also leads to an informative unsupervisedquantity, calledConsistency Energy, based on measuringthe intrinsic consistency of the system. Consistency En-ergy correlates well with the supervised error (r=0.67),thus it can be employed as an unsupervised confidencemetric as well as for detection of out-of-distribution inputs(ROC-AUC=0.95). The evaluations are performed on multi-ple datasets, including Taskonomy, Replica, CocoDoom, andApolloScape, and they benchmark cross-task consistencyversus various baselines including conventional multi-tasklearning, cycle consistency, and analytical consistency.

2020-09-09

Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation

Accepted to: CVPR 2020 (Poster)

Presenter: Ruochen Wang

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: We present a lightweight solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. Building upon recent advances in interpretable representation learning, we exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. This allows us to reason effectively about 3D pose across different views without using compute-intensive volumetric grids. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections, that can be simply lifted to 3D via a differentiable Direct Linear Transform (DLT) layer. In order to do it efficiently, we propose a novel implementation of DLT that is orders of magnitude faster on GPU architectures than standard SVD-based triangulation methods. We evaluate our approach on two large-scale human pose datasets (H36M and Total Capture): our method outperforms or performs comparably to the state-of-the-art volumetric methods, while, unlike them, yielding real-time performance.

2020-08-26

Towards Visually Explaining Variational Autoencoders

Accepted to: CVPR 2020 (Oral)

Presenter: Yi Liu

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, e.g. variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset.

2020-08-26

CPM-Nets: Cross Partial Multi-View Networks

Accepted to: NeurIPS 2019 (Poster)

Presenter: Meng Ma

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Despite multi-view learning progressed fast in past decades, it is still challenging due to the difficulty in modeling complex correlation among different views, especially under the context of view missing. To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets). In this framework, we first give a formal definition of completeness and versatility for multi-view representation and then theoretically prove the versatility of the latent representation learned from our algorithm. To achieve the completeness, the task of learning latent multi-view representation is specifically translated to degradation process through mimicking data transmitting, such that the optimal tradeoff between consistence and complementarity across different views could be achieved. In contrast with methods that either complete missing views or group samples according to view-missing patterns, our model fully exploits all samples and all views to produce structured representation for interpretability. Extensive experimental results validate the effectiveness of our algorithm over existing state-of-the-arts.

2020-08-19

Bilevel Continual Learning

Accepted to: ArXiv 2020

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Continual learning aims to learn continuously from a stream of tasks and data in an online-learning fashion, being capable of exploiting what was learned previously to improve current and future tasks while still being able to perform well on the previous tasks. One common limitation of many existing continual learning methods is that they often train a model directly on all available training data without validation due to the nature of continual learning, thus suffering poor generalization at test time. In this work, we present a novel framework of continual learning named "Bilevel Continual Learning" (BCL) by unifying a {\it bilevel optimization} objective and a {\it dual memory management} strategy comprising both episodic memory and generalization memory to achieve effective knowledge transfer to future tasks and alleviate catastrophic forgetting on old tasks simultaneously. Our extensive experiments on continual learning benchmarks demonstrate the efficacy of the proposed BCL compared to many state-of-the-art methods. Our implementation is available at https://github.com/phquang/bilevel-continual-learning.

2020-08-12

Vision-Based Fall Detection with Convolutional Neural Networks

Accepted to: Wireless Communications and Mobile Computing 2017

Presenter: Hamed Fayyaz

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: One of the biggest challenges in modern societies is the improvement of healthy aging and the support to older persons in their daily activities. In particular, given its social and economic impact, the automatic detection of falls has attracted considerable attention in the computer vision and pattern recognition communities. Although the approaches based on wearable sensors have provided high detection rates, some of the potential users are reluctant to wear them and thus their use is not yet normalized. As a consequence, alternative approaches such as vision-based methods have emerged. We firmly believe that the irruption of the Smart Environments and the Internet of Things paradigms, together with the increasing number of cameras in our daily environment, forms an optimal context for vision-based systems. Consequently, here we propose a vision-based solution using Convolutional Neural Networks to decide if a sequence of frames contains a person falling. To model the video motion and make the system scenario independent, we use optical flow images as input to the networks followed by a novel three-step training phase. Furthermore, our method is evaluated in three public datasets achieving the state-of-the-art results in all three of them.

2020-08-05

Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction

Accepted to: CVPR 2020 (Poster)

Presenter: Pranjal Dhakal

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Better machine understanding of pedestrian behaviors enables faster progress in modeling interactions between agents such as autonomous vehicles and humans. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects. Previous methods modeled these interactions by using a variety of aggregation methods that integrate different learned pedestrians states. We propose the Social Spatio-Temporal Graph Convolutional Neural Network (Social-STGCNN), which substitutes the need of aggregation methods by modeling the interactions as a graph. Our results show an improvement over the state of art by 20% on the Final Displacement Error (FDE) and an improvement on the Average Displacement Error (ADE) with 8.5 times less parameters and up to 48 times faster inference speed than previously reported methods. In addition, our model is data efficient, and exceeds previous state of the art on the ADE metric with only 20% of the training data. We propose a kernel function to embed the social interactions between pedestrians within the adjacency matrix. Through qualitative analysis, we show that our model inherited social behaviors that can be expected between pedestrians trajectories. Code is available at https://github.com/abduallahmohamed/Social-STGCNN.

2020-07-29

4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras

Accepted to: CVPR 2020 (Oral)

Presenter: Ruochen Wang

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: This paper contributes a novel realtime multi-person motion capture algorithm using multiview video inputs. Due to the heavy occlusions in each view, joint optimization on the multiview images and multiple temporal frames is indispensable, which brings up the essential challenge of realtime efficiency. To this end, for the first time, we unify per-view parsing, cross-view matching, and temporal tracking into a single optimization framework, i.e., a 4D association graph that each dimension (image space, viewpoint and time) can be treated equally and simultaneously. To solve the 4D association graph efficiently, we further contribute the idea of 4D limb bundle parsing based on heuristic searching, followed with limb bundle assembling by proposing a bundle Kruskal's algorithm. Our method enables a realtime online motion capture system running at 30fps using 5 cameras on a 5-person scene. Benefiting from the unified parsing, matching and tracking constraints, our method is robust to noisy detection, and achieves high-quality online pose reconstruction quality. The proposed method outperforms the state-of-the-art method quantitatively without using high-level appearance information. We also contribute a multiview video dataset synchronized with a marker-based motion capture system for scientific evaluation.

2020-07-22

Sampling-free Epistemic Uncertainty Estimation Using Approximated Variance Propagation

Accepted to: ICCV 2019 (Oral)

Presenter: Ziyang jia

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: We present a sampling-free approach for computing the epistemic uncertainty of a neural network. Epistemic uncertainty is an important quantity for the deployment of deep neural networks in safety-critical applications, since it represents how much one can trust predictions on new data. Recently promising works were proposed using noise injection combined with Monte-Carlo sampling at inference time to estimate this quantity (e.g. Monte-Carlo dropout). Our main contribution is an approximation of the epistemic uncertainty estimated by these methods that does not require sampling, thus notably reducing the computational overhead. We apply our approach to large-scale visual tasks (i.e., semantic segmentation and depth regression) to demonstrate the advantages of our method compared to sampling-based approaches in terms of quality of the uncertainty estimates as well as of computational overhead.

2020-07-15

Weight Agnostic Neural Networks

Accepted to: NeurIPS 2019 (Poster)

Presenter: Zhenzhu Zheng

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much higher than chance accuracy on MNIST using random weights. Interactive version of this paper at https://weightagnostic.github.io

2020-07-08

Deep InfoMax: Learning deep representations by mutual information estimation and maximization

Accepted to: ICLR 2019 (Oral)

Presenter: Yi Liu

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: This work investigates unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and compares favorably with fully-supervised learning on several classification tasks in with some standard architectures. DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation learning objectives for specific end-goals.

2020-07-01

Tracking by Instance Detection: A Meta-Learning Approach

Accepted to: CVPR 2020 (Oral)

Presenter: Meng Ma

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: We consider the tracking problem as a special type of object detection problem, which we call instance detection. With proper initialization, a detector can be quickly converted into a tracker by learning the new instance from a single image. We find that model-agnostic meta-learning (MAML) offers a strategy to initialize the detector that satisfies our needs. We propose a principled three-step approach to build a high-performance tracker. First, pick any modern object detector trained with gradient descent. Second, conduct offline training (or initialization) with MAML. Third, perform domain adaptation using the initial frame. We follow this procedure to build two trackers, named Retina-MAML and FCOS-MAML, based on two modern detectors RetinaNet and FCOS. Evaluations on four benchmarks show that both trackers are competitive against state-of-the-art trackers. On OTB-100, Retina-MAML achieves the highest ever AUC of 0.712. On TrackingNet, FCOS-MAML ranks the first on the leader board with an AUC of 0.757 and the normalized precision of 0.822. Both trackers run in real-time at 40 FPS.

2020-06-24

Open Compound Domain Adaptation

Accepted to: CVPR 2020 (Oral)

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Zoom: https://udel.zoom.us/j/96483565680

Slides: link

Abstract: A typical domain adaptation approach is to adapt models trained on the annotated data in a source domain (e.g., sunny weather) for achieving high performance on the test data in a target domain (e.g., rainy weather). Whether the target contains a single homogeneous domain or multiple heterogeneous domains, existing works always assume that there exist clear distinctions between the domains, which is often not true in practice (e.g., changes in weather). We study an open compound domain adaptation (OCDA) problem, where the target is a compound of multiple homogeneous domains without domain labels, reflecting realistic data collection from mixed and novel situations. We propose a new approach based on two technical insights into OCDA: 1) a curriculum domain adaptation strategy to bootstrap generalization across domains in a data-driven self-organizing fashion and 2) a memory module to increase the model’s agility towards novel domains. Our experiments on digit classification, facial expression recognition, semantic segmentation, and reinforcement learning demonstrate the effectiveness of our approach.

About Reading Group

The reading group is held weekly by D-REAL at University of Delaware. The goal is to broaden the scope of research interest in Machine Learning, Deep Learning, and Computer Vision by sharing and discussing high-quality papers.

Schedule at Winter & Spring 2024

  • Kien Nguyen: Jan. 23th
  • Tang Li: Feb. 8th
  • Fengchun Qiao: Feb. 27th
  • Ricardo Santos: Mar. 12th
  • Tang Li: Mar. 26th
  • Jeffrey Peng: Apr. 2nd
  • Qitong Wang: Apr. 9th
  • Fengchun Qiao: Apr. 23rd
  • Meng Ma: Apr. 30th

Contact

  • Person: Qitong Wang
  • Email: wqtwjt at udel dot edu