Deep Robust & Explainable AI Lab Reading Group

Practitioners increasingly use machine learning (ML) models, yet models have become more complex and harder to understand. To understand complex models, researchers have proposed techniques to explain model predictions. However, practitioners struggle to use explainability methods because they do not know which explanation to choose and how to interpret the explanation. Here we address the challenge of using explainability methods by proposing TalkToModel: an interactive dialogue system that explains ML models through natural language conversations. TalkToModel consists of three components: an adaptive dialogue engine that interprets natural language and generates meaningful responses; an execution component that constructs the explanations used in the conversation; and a conversational interface. In real-world evaluations, 73% of healthcare workers agreed they would use TalkToModel over existing systems for understanding a disease prediction model, and 85% of ML professionals agreed TalkToModel was easier to use, demonstrating that TalkToModel is highly effective for model explainability.

2023-10-17

Resolving Interference When Merging Models

Accepted by: NeurIPS-2023

Presenter: Fengchun Qiao

Time: 4:00-6:00 p.m. EDT

Slides: link

Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models. To address this, we propose our method, TrIm, Elect Sign & Merge (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of resolving sign interference.

2023-10-10

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Accepted by: ICCV-2023

Presenter: Kien Nguyen

Time: 4:00-6:00 p.m. EDT

Slides: link

Project Page: https://whoops-benchmark.github.io/

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities.

2023-09-19

GFPose: Learning 3D Human Pose Prior with Gradient Fields

Accepted by: CVPR-2023

Personalized Federated Learning with Inferred Collaboration Graphs

Accepted by: ICML-2023

Presenter: Meng Ma

Time: 4:00-6:00 p.m. EDT

Code: https://github.com/MediaBrain-SJTU/pFedGraph

Personalized federated learning (FL) aims to collaboratively train a personalized model for each client. Previous methods do not adaptively determine who to collaborate at a fine-grained level, making them difficult to handle diverse data heterogeneity levels and those cases where malicious clients exist. To address this issue, our core idea is to learn a collaboration graph, which models the benefits from each pairwise collaboration and allocates appropriate collaboration strengths. Based on this, we propose a novel personalized FL algorithm, pFedGraph, which consists of two key modules: (1) inferring the collaboration graph based on pairwise model similarity and dataset size at server to promote fine-grained collaboration and (2) optimizing local model with the assistance of aggregated model at client to promote personalization. The advantage of pFedGraph is flexibly adaptive to diverse data heterogeneity levels and model poisoning attacks, as the proposed collaboration graph always pushes each client to collaborate more with similar and beneficial clients. Extensive experiments show that pFedGraph consistently outperforms the other baseline methods across various heterogeneity levels and multiple cases where malicious clients exist.

2023-08-22

"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts

Accepted by: ICML-2023

Presenter: Fengchun Qiao

Time: 3:30-5:00 p.m. EDT

Zoom: https://udel.zoom.us/j/94110688939

Machine learning models frequently experience performance drops under distribution shifts. The underlying cause of such shifts may be multiple simultaneous factors such as changes in data quality, differences in specific covariate distributions, or changes in the relationship between label and features. When a model does fail during deployment, attributing performance change to these factors is critical for the model developer to identify the root cause and take mitigating actions. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.

2023-08-15

Concept Bottleneck Models

Accepted by: ICML-2020

Presenter: Kien Nguyen

Time: 4:00-5:30 p.m. EDT

Slides: link

We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.

2023-05-23

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

Accepted by: AAAI-2023

Presenter: Ziyang Jia

Deep Model Reassembly

Accepted by: NeurIPS-2022

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EDT

Slides: link

Code: https://github.com/Adamdad/DeRy

In this paper, we explore a novel knowledge-transfer task, termed as Deep Model Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous models pre-trained from distinct sources and with diverse architectures, the goal of DeRy, as its name implies, is to first dissect each model into distinctive building blocks, and then selectively reassemble the derived blocks to produce customized networks under both the hardware resource and performance constraints. Such ambitious nature of DeRy inevitably imposes significant challenges, including, in the first place, the feasibility of its solution. We strive to showcase that, through a dedicated paradigm proposed in this paper, DeRy can be made not only possibly but practically efficiently. Specifically, we conduct the partitions of all pre-trained networks jointly via a cover set optimization, and derive a number of equivalence set, within each of which the network blocks are treated as functionally equivalent and hence interchangeable. The equivalence sets learned in this way, in turn, enable picking and assembling blocks to customize networks subject to certain constraints, which is achieved via solving an integer program backed up with a training-free proxy to estimate the task performance. The reassembled models, give rise to gratifying performances with the user-specified constraints satisfied. We demonstrate that on ImageNet, the best reassemble model achieves 78.6% top-1 accuracy without fine-tuning, which could be further elevated to 83.2% with end-to-end training.

2023-04-04

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Accepted by: ICLR-2023

Presenter: Tang Li

Time: 3:30-5:00 p.m. EDT

Slides: link

Project Page: https://socraticmodels.github.io/

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

2023-03-21

Delaunay Component Analysis for Evaluation of Data Representations

Accepted by: ICLR-2022

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EDT

Advanced representation learning techniques require reliable and general evaluation methods. Recently, several algorithms based on the common idea of geometric and topological analysis of a manifold approximated from the learned data representations have been proposed. In this work, we introduce Delaunay Component Analysis (DCA) - an evaluation algorithm which approximates the data manifold using a more suitable neighbourhood graph called Delaunay graph. This provides a reliable manifold estimation even for challenging geometric arrangements of representations such as clusters with varying shape and density as well as outliers, which is where existing methods often fail. Furthermore, we exploit the nature of Delaunay graphs and introduce a framework for assessing the quality of individual novel data representations. We experimentally validate the proposed DCA method on representations obtained from neural networks trained with contrastive objective, supervised and generative models, and demonstrate various use cases of our extended single point evaluation framework.

2023-03-07

ReAct: Synergizing Reasoning and Acting in Language Models

Accepted by: ICLR-2023

Presenter: Amani Arman Kiruga

Time: 3:30-5:00 p.m. EST

Slides: link

Project Page: https://react-lm.github.io/

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.

2023-03-01

Leveraging Domain Relations for Domain Generalization

Submitted to: arXiv-2023

Presenter: Fengchun Qiao

Time: 4:30-6:00 p.m. EST

Slides: link

Distribution shift is a major challenge in machine learning, as models often perform poorly during the test stage if the test distribution differs from the training distribution. In this paper, we focus on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called D^3G. Unlike previous approaches that aim to learn a single model that is domain invariant, D^3G learns domain-specific models by leveraging the relations among different domains. Concretely, D^3G learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly derived or learned from fixed domain meta-data. Under mild assumptions, we theoretically proved that using domain relations to reweight training-domain-specific functions achieves stronger generalization compared to averaging them. Empirically, we evaluated the effectiveness of D^3G using both toy and real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein interaction prediction. Our results showed that D^3G consistently outperformed state-of-the-art methods, with an average improvement of 10.6% in performance.

2023-02-21

Agree to Disagree: Diversity through Disagreement for Better Transferability

Accepted by: ICLR-2023

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EST

Slides: link

Gradient-based learning algorithms have an implicit \emph{simplicity bias} which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features --- present in the training data but absent from the test data --- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution---referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.

2023-02-14

Diagnosing and Rectifying Vision Models using Language

Accepted by: ICLR-2023

Presenter: Tang Li

Time: 4:30-5:00 p.m. EST

Slides: link

Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.

2023-02-07

MBW: Multi-view Bootstrapping in the Wild

Accepted by: NeurIPS-2022

Presenter: Ziyang Jia

Time: 4:00-5:00 p.m. EST

Code: https://github.com/mosamdabhi/MBW

Labeling articulated objects in unconstrained settings have a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled. The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios. In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates from videos with only two or three uncalibrated, handheld cameras. With just a few annotations (representing 1-2% of the frames), we are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impressive results on standard human datasets, as well as tigers, cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We release the codebase for MBW as well as this challenging zoo dataset consisting image frames of tail-end distribution categories with their corresponding 2D, 3D labels generated from minimal human intervention.

2023-01-17

Conformal Time-Series Forecasting

Accepted by: NeurIPS-2021

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/99637309310

Slides: link

Current approaches for multi-horizon time series forecasting using recurrent neural networks (RNNs) focus on issuing point estimates, which is insufficient for decision-making in critical application domains where an uncertainty estimate is also required. Existing approaches for uncertainty quantification in RNN-based time-series forecasts are limited as they may require significant alterations to the underlying model architecture, may be computationally complex, may be difficult to calibrate, may incur high sample complexity, and may not provide theoretical guarantees on frequentist coverage. In this paper, we extend the inductive conformal prediction framework to the time-series forecasting setup, and propose a lightweight algorithm to address all of the above limitations, providing uncertainty estimates with theoretical guarantees for any multi-horizon forecast predictor and any dataset with minimal exchangeability assumptions. We demonstrate the effectiveness of our approach by comparing it with existing benchmarks on a variety of synthetic and real-world datasets.

2022-12-20

Survey of Diffusion Model

Presenter: Qitong Wang

Time: 3:30-5:00 p.m. EST

Zoom: https://udel.zoom.us/j/94085069142

Slides: link

Related Paper 1: Accepted by NeurIPS, 2022.

Related Paper 2: Submitted to avXiv, 2022.

Related Paper 3: Accepted by CVPR, 2022.

On the Strong Correlation Between Model Invariance and Generalization

Accepted by: NeurIPS-2022

Presenter: Meng Ma

Time: 3:30-5:00 p.m. EDT

Slides: link

Generalization and invariance are two essential properties of machine learning models. Generalization captures a model’s ability to classify unseen data while invariance measures consistency of model predictions on transformations of the data. Existing research suggests a positive relationship: a model generalizing well should be invariant to certain visual factors. Building on this qualitative implication we make two contributions. First, we introduce effective invariance (EI), a simple and reasonable measure of model invariance which does not rely on image labels. Given predictions on a test image and its transformed version, EI measures how well the predictions agree and with what level of confidence. Second, using invariance scores computed by EI, we perform large-scale quantitative correlation studies between generalization and invariance, focusing on rotation and grayscale transformations. From a model-centric view, we observe generalization and invariance of different models exhibit a strong linear relationship, on both in-distribution and out-of-distribution datasets. From a dataset-centric view, we find a certain model’s accuracy and invariance linearly correlated on different test sets. Apart from these major findings, other minor but interesting insights are also discussed.

2022-10-18

Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation

Accepted by: CVPR-2020

Presenter: Tang Li

Time: 3:30-5:00 p.m. EDT

Slides: link

Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online.

2022-10-11

Presenter: Kien Nguyen

Time: 3:30-5:00 p.m. EDT

2022-10-04

Conditional Prompt Learning for Vision-Language Models

Accepted by: CVPR-2022

Presenter: Qitong Wang

Time: 3:30-5:00 p.m. EDT

Code: https://github.com/KaiyangZhou/CoOp

Slides: link

2022-09-27

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Accepted by: NeurIPS-2021

Presenter: Amani Arman Kiruga

Self-Supervised Learning Disentangled Group Representation as Feature

Accepted by: NeurIPS-2021

Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Accpeted by: ECCV-2022

Presenter: Qitong Wang

Time: 1:00-2:00 p.m. EDT

Project Page: https://yuliang.vision/InstCal/

Zoom: https://udel.zoom.us/j/92841583823

Slides: link

We propose a test-time adaptation method for cross-domain image segmentation. Our method is simple: Given a new unseen instance at test time, we adapt a pre-trained model by conducting instance-specific BatchNorm (statistics) calibration. Our approach has two core components. First, we replace the manually designed BatchNorm calibration rule with a learnable module. Second, we leverage strong data augmentation to simulate random domain shifts for learning the calibration rule. In contrast to existing domain adaptation methods, our method does not require accessing the target domain data at training time or conducting computationally expensive test-time model training/optimization. Equipping our method with models trained by standard recipes achieves significant improvement, comparing favorably with several state-of-the-art domain generalization and one-shot unsupervised domain adaptation approaches. Combining our method with the domain generalization methods further improves performance, reaching a new state of the art.

2022-05-25

Do Feature Attribution Methods Correctly Attribute Features?

Accepted by: AAAI-2022

Presenter: Ziyang Jia

Time: 1:30-3:00 p.m. EDT

Zoom: https://udel.zoom.us/j/94384898933

Feature attribution methods are popular in interpretable machine learning. These methods compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of "attribution", leading to many competing methods with little systematic evaluation, complicated in particular by the lack of ground truth attribution. To address this, we propose a dataset modification procedure to induce such ground truth. Using this procedure, we evaluate three common methods: saliency maps, rationales, and attentions. We identify several deficiencies and add new perspectives to the growing body of evidence questioning the correctness and reliability of these methods applied on datasets in the wild. We further discuss possible avenues for remedy and recommend new attribution methods to be tested against ground truth before deployment.

2022-04-26

Explainable Deep Classification Models for Domain Generalization

Accepted by: CVPR-W-2021

Presenter: Tang Li

Time: 12:30-2:00 p.m. EDT

Zoom: https://udel.zoom.us/j/97832270671

Slides: link

Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network's decision. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We quantify explainability using an automated metric, and using human judgement. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain.

2022-04-19

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

Accepted by: ICCV-2021

Survey on Out-of-Domain Detection in Deep Learning

Presenter: Wenxuan Li

Time: 7:00-8:00 p.m. EST

Slides: link

In this survey, we focus on the out-of-domain detection which in other words, covariate shift detection. First, started with a broad wide topic “out-of-distribution detection” by given a unified framework termed generalized out-of-distribution detection, since out-of-domain detection is a subtask of out-of-distribution detection. Under the given framework, the five problems which include Anomaly Detection, Novelty Detection, Open Set Recognition, Out-of-Distribution Detection, and Outlier Detection, can be viewed as special cases or subtopics. Then clarified the background, definition, and application of each subtopic. Since these five subtopics could be categorized by whether occurs covariate shift or semantic shift or both, and currently, we are only interested in the covariate shift detection. Therefore, we selected the subtopics occurred covariate shift, summarized recent technical developments and categorized existing methods of each subtopic. At the end, a comprehensive paper list of all five subtopics is given.

2021-11-10

Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos

Accepted by: CVPR-2021

Presenter: Shivanand Venkanna Sheshappanavar

Time: 8:00-9:00 p.m. EST

Slides: link

Point cloud videos exhibit irregularities and lack of order along the spatial dimension where points emerge inconsistently across different frames. To capture the dynamics in point cloud videos, point tracking is usually employed. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult. Moreover, tracking usually relies on point colors and thus may fail to handle colorless point clouds. In this paper, to avoid point tracking, we propose a novel Point 4D Transformer (P4Transformer) network to model raw point cloud videos. Specifically, P4Transformer consists of (i) a point 4D convolution to embed the spatio-temporal local structures presented in a point cloud video and (ii) a transformer to capture the appearance and motion information across the entire video by performing self-attention on the embedded local features. In this fashion, related or similar local areas are merged with attention weight rather than by explicit tracking. Extensive experiments, including 3D action recognition and 4D semantic segmentation, on four benchmarks demonstrate the effectiveness of our P4Transformer for point cloud video modeling.

2021-10-27

ViViT: A Video Vision Transformer

Accepted by: ICCV-2021

Presenter: Kien Nguyen

Time: 8:00-9:00 p.m. EDT

Slides: link

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

2021-10-20

Volumetric Breast Density Estimation on MRI Using Explainable Deep Learning Regression

Accepted by: Nature: Scientific Reports-2021

Presenter: Tang Li

Time: 8:00-9:00 p.m. EDT

Slides: link

To purpose of this paper was to assess the feasibility of volumetric breast density estimations on MRI without segmentations accompanied with an explainability step. A total of 615 patients with breast cancer were included for volumetric breast density estimation. A 3-dimensional regression convolutional neural network (CNN) was used to estimate the volumetric breast density. Patients were split in training (N = 400), validation (N = 50), and hold-out test set (N = 165). Hyperparameters were optimized using Neural Network Intelligence and augmentations consisted of translations and rotations. The estimated densities were evaluated to the ground truth using Spearman’s correlation and Bland–Altman plots. The output of the CNN was visually analyzed using SHapley Additive exPlanations (SHAP). Spearman’s correlation between estimated and ground truth density was ρ = 0.81 (N = 165, P < 0.001) in the hold-out test set. The estimated density had a median bias of 0.70% (95% limits of agreement = − 6.8% to 5.0%) to the ground truth. SHAP showed that in correct density estimations, the algorithm based its decision on fibroglandular and fatty tissue. In incorrect estimations, other structures such as the pectoral muscle or the heart were included. To conclude, it is feasible to automatically estimate volumetric breast density on MRI without segmentations, and to provide accompanying explanations.

2021-10-13

Efficient Continual Learning with Modular Networks and Task-Driven Prior

Submitted to: ICLR-2021

Presenter: Fengchun Qiao

Time: 8:00-9:00 p.m. EDT

Slides: link

Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. The Benchmark is publicly available at https://github.com/facebookresearch/CTrLBenchmark.

2021-10-06

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Submitted to: ArXiv-2021

Presenter: Meng Ma

Time: 8:00-9:00 p.m. EDT

Slides: link

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers

2021-09-29

Time-series Generative Adversarial Networks

Accepted to: NeurIPS-2019

Presenter: Pranjal Dhakal

Time: 8:00-9:00 p.m. EDT

Slides: link

A good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between variables across time. Existing methods that bring generative adversarial networks (GANs) into the sequential setting do not adequately attend to the temporal correlations unique to time-series data. At the same time, supervised models for sequence prediction - which allow finer control over network dynamics - are inherently deterministic. We propose a novel framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, we encourage the network to adhere to the dynamics of the training data during sampling. Empirically, we evaluate the ability of our method to generate realistic samples using a variety of real and synthetic time-series datasets. Qualitatively and quantitatively, we find that the proposed framework consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.

2021-09-22

Broaden Your Views for Self-Supervised Video Learning

Accepted to: ICCV-2021

Presenter: Amani Arman Kiruga

Time: 8:00-9:00 p.m. EDT

Slides: link

Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.

2021-09-15

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Accepted to: ICML-2020

Presenter: Qitong Wang

Time: 8:00-9:00 p.m. EDT

Slides: link

Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https://github.com/Linear95/CLUB.

2021-05-12

Contrastive Multiview Coding

Accepted to: ArXiv'19

Presenter: Qitong

Time: 9:00-10:00 p.m. EST

Slides: link

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video unsupervised learning benchmarks.

2021-05-05

UNet++: A Nested U-Net Architecturefor Medical Image Segmentation

Accepted to: 4th Deep Learning in Medical Image Analysis (DLMIA) Workshop

Presenter: Tang

Time: 9:00-10:00 p.m. EST

Slides: link

In this paper, we present UNet++, a new, more powerful ar-chitecture for medical image segmentation. Our architecture is essentiallya deeply-supervised encoder-decoder network where the encoder and de-coder sub-networks are connected through a series of nested, dense skippathways. The re-designed skip pathways aim at reducing the semanticgap between the feature maps of the encoder and decoder sub-networks.We argue that the optimizer would deal with an easier learning task whenthe feature maps from the decoder and encoder networks are semanticallysimilar. We have evaluated UNet++ in comparison with U-Net and wideU-Net architectures across multiple medical image segmentation tasks:nodule segmentation in the low-dose CT scans of chest, nuclei segmen-tation in the microscopy images, liver segmentation in abdominal CTscans, and polyp segmentation in colonoscopy videos. Our experimentsdemonstrate that UNet++ with deep supervision achieves an averageIoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

2021-04-21

Few-Shot Adversarial Domain Adaptation

Accepted to: NeurIPS 2017

Presenter: Pranjal

Time: 9:00-10:00 p.m. EST

Slides: link

This work provides a framework for addressing the problem of supervised domainadaptation with deep models. The main idea is to exploit adversarial learning tolearn an embedded subspace that simultaneously maximizes the confusion betweentwo domains while semantically aligning their embedding. The supervised settingbecomes attractive especially when there are only a few target data samples thatneed to be labeled. In thisfew-shot learningscenario, alignment and separation ofsemantic probability distributions is difficult because of the lack of data. We foundthat by carefully designing a training scheme whereby the typical binary adversarialdiscriminator is augmented to distinguish between four different classes, it ispossible to effectively address the supervised adaptation problem. In addition, theapproach has a high “speed” of adaptation, i.e. it requires an extremely low numberof labeled target training samples, even one per category can be effective. We thenextensively compare this approach to the state of the art in domain adaptation intwo experiments: one using datasets for handwritten digit recognition, and oneusing datasets for visual object recognition.

2021-04-14

Memory-augmented Dense Predictive Coding for Video Representation Learning

Accepted to: ECCV 2020

Presenter: Ziyang

Time: 9:00-10:00 p.m. EST

Slides: Coming soon...

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.

2021-04-07

VIBE: Video Inference for Human Body Pose and Shape Estimation

Accepted to: CVPR 2020

Presenter: Ruochen

Time: 9:00-10:00 p.m. EST

Slides: link

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance.

2021-03-31

Perceiver: General Perception with Iterative Attention

Accepted to: arXiv 2021

Presenter: Meng

Time: 9:00-10:00 p.m. EST

Slides: link

Biological systems understand the world by si-multaneously processing high-dimensional inputsfrom modalities as diverse as vision, audition,touch, proprioception, etc. The perception mod-els used in deep learning on the other hand aredesigned for individual modalities, often relyingon domain-specific assumptions such as the localgrid structures exploited by virtually all existingvision models. These priors introduce helpful in-ductive biases, but also lock models to individualmodalities. In this paper we introducethe Per-ceiver– a model that builds upon Transformersand hence makes few architectural assumptionsabout the relationship between its inputs, but thatalso scales to hundreds of thousands of inputs,like ConvNets. The model leverages an asymmet-ric attention mechanism to iteratively distill inputsinto a tight latent bottleneck, allowing it to scale tohandle very large inputs. We show that this archi-tecture performs competitively or beyond strong,specialized models on classification tasks acrossvarious modalities: images, point clouds, audio,video and video+audio. The Perceiver obtains per-formance comparable to ResNet-50 on ImageNetwithout convolutions and by directly attending to50,000 pixels. It also surpasses state-of-the-artresults for all modalities in AudioSet.

2021-03-24

Sensor based Prediction of Human Driving Decisions using Feed forward Neural Networks for Intelligent Vehicles

Accepted to: International Conference on Intelligent Transportation Systems 2018

Presenter: Tanvir

Time: 9:00-10:00 p.m. EST

Slides: link

Prediction of human driving decisions is an important aspect of modeling human behavior for the application to Advanced Driver Assistance Systems (ADAS) in the intelligent vehicles. This paper presents a sensor based receding horizon model for the prediction of human driving commands. Human driving decisions are expressed in terms of the vehicle speed and steering wheel angle profiles. Environmental state and human intention are the two major factors influencing the human driving decisions. The environment around the vehicle is perceived using LIDAR sensor. Feature extractor computes the occupancy grid map from the sensor data which is filtered and processed to provide precise and relevant information to the feed-forward neural network. Human intentions can be identified from the past driving decisions and represented in the form of time series data for the neural network. Supervised machine learning is used to train the neural network. Data collection and model validation is performed in the driving simulator using the SCANeR studio software. Simulation results are presented alone with the analysis.

2021-03-17

Discovery of Latent 3D Keypoints viaEnd-to-end Geometric Reasoning

Accepted to: NeurIPS 2018

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Slides: link

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder. The bestperforming models also connect the encoder and decoder through an attentionmechanism. We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely. Experiments on two machine translation tasks show these models tobe superior in quality while being more parallelizable and requiring significantlyless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, includingensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,our model establishes a new single-model state-of-the-art BLEU score of 41.0 aftertraining for 3.5 days on eight GPUs, a small fraction of the training costs of thebest models from the literature.

2021-03-10

Attention Is All You Need

Accepted to: NeurIPS 2017

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Slides: link

This paper presentsKeypointNet, an end-to-end geometric reasoning framework tolearnan optimal set of category-specific 3D keypoints, along with their detectors. Given a single image, KeypointNet extracts 3D keypoints that are optimized fora downstream task. We demonstrate this framework on 3D pose estimation byproposing a differentiable objective that seeks the optimal set of keypoints forrecovering the relative pose between two views of an object. Our model discovers geometrically and semantically consistent keypoints across viewing angles andinstances of an object category. Importantly, we find that our end-to-end frameworkusing no ground-truth keypoint annotations outperforms a fully supervised baseline using the same neural network architecture on the task of pose estimation. The discovered 3D keypoints on the car, chair, and plane categories of ShapeNet are visualized at keypointnet.github.io

2020-12-16

OOPS! Predicting Unintentional Action in Video

Accepted to: CVPR 2020

Presenter: Tang Li

Time: 9:00-10:00 p.m. EST

Slides: link

From just a short glance at a video, we can often tell whether a person’s action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the- wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and ana- lyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pre- training. However, a significant gap between machine and human performance remains.

2020-12-09

TLIO: Tight Learned Inertial Odometry

Accepted to: IEEE Robotics and Automation Letters 2020

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Slides: link

In this work we propose a tightly-coupled Extended Kalman Filter framework for IMU-only state estimation. Strap-down IMU measurements provide relative state estimates based on IMU kinematic motion model. However the integration of measurements is sensitive to sensor bias and noise, causing significant drift within seconds. Recent research by Yan et al. (RoNIN) and Chen et al. (IONet) showed the capability of using trained neural networks to obtain accurate 2D displacement estimates from segments of IMU data and obtained good position estimates from concatenating them. This paper demonstrates a network that regresses 3D displacement estimates and its uncertainty, giving us the ability to tightly fuse the relative state measurement into a stochastic cloning EKF to solve for pose, velocity and sensor biases. We show that our network, trained with pedestrian data from a headset, can produce statistically consistent measurement and uncertainty to be used as the update step in the filter, and the tightly-coupled system outperforms velocity integration approaches in position estimates, and AHRS attitude filter in orientation estimates.

2020-11-18

Graph U-Nets

Accepted to: ICML 2019

Presenter: Pranjal

Time: 9:00-10:00 p.m. EST

Slides: link

We consider the problem of representation learning for graph data. Convolutional neural networkscan naturally operate on images, but have sig-nificant challenges in dealing with graph data.Given images are special cases of graphs withnodes lie on 2D lattices, graph embedding taskshave a natural correspondence with image pixel-wise prediction tasks such as segmentation. Whileencoder-decoder architectures like U-Nets havebeen successfully applied on many image pixel-wise prediction tasks, similar methods are lack-ing for graph data. This is due to the fact thatpooling and up-sampling operations are not nat-ural on graph data. To address these challenges,we propose novel graph pooling (gPool) and un-pooling (gUnpool) operations in this work. ThegPool layer adaptively selects some nodes to forma smaller graph based on their scalar projectionvalues on a trainable projection vector. We fur-ther propose the gUnpool layer as the inverse op-eration of the gPool layer. The gUnpool layerrestores the graph into its original structure us-ing the position information of nodes selectedin the corresponding gPool layer. Based on ourproposed gPool and gUnpool layers, we developan encoder-decoder model on graph, known asthe graph U-Nets. Our experimental results onnode classification and graph classification tasksdemonstrate that our methods achieve consistentlybetter performance than previous models.

2020-11-11

Occlusion Aware Unsupervised Learning of Optical Flow

Accepted to: CVPR 2018

Presenter: Ziyang Jia

Time: 9:00-10:00 p.m. EST

Slides: link

It has been recently shown that a convolutional neural network can learn optical flow estimation with unsuper- vised learning. However, the performance of the unsuper- vised methods still has a relatively large gap compared to its supervised counterpart. Occlusion and large motion are some of the major factors that limit the current unsuper- vised learning of optical flow methods. In this work we introduce a new method which models occlusion explicitly and a new warping way that facilitates the learning of large motion. Our method shows promising results on Flying Chairs, MPI-Sintel and KITTI benchmark datasets. Espe- cially on KITTI dataset where abundant unlabeled samples exist, our unsupervised method outperforms its counterpart trained with supervised learning.

2020-11-4

Exploiting temporal information for 3D humanpose estimation

Accepted to: ECCV 2018

Presenter: Ruochen Wang

Time: 9:00-10:00 p.m. EST

Slides: link

In this work, we address the problem of 3D human pose esti-mation from a sequence of 2D human poses. Although the recent successof deep networks has led many state-of-the-art methods for 3D pose esti-mation to train deep networks end-to-end to predict from images directly,the top-performing approaches have shown the effectiveness of dividingthe task of 3D pose estimation into two steps: using a state-of-the-art 2Dpose estimator to estimate the 2D pose from images and then mappingthem into 3D space. They also showed that a low-dimensional represen-tation like 2D locations of a set of joints can be discriminative enoughto estimate 3D pose with high accuracy. However, estimation of 3D posefor individual frames leads to temporally incoherent estimates due to in-dependent error in each frame causing jitter. Therefore, in this work weutilize the temporal information across a sequence of 2D joint locationsto estimate a sequence of 3D poses. We designed a sequence-to-sequencenetwork composed of layer-normalized LSTM units with shortcut con-nections connecting the input to the output on the decoder side andimposed temporal smoothness constraint during training. We found thatthe knowledge of temporal consistency improves the best reported resulton Human3.6M dataset by approximately 12.2% and helps our networkto recover temporally consistent 3D poses over a sequence of images evenwhen the 2D pose detector fails.

2020-10-28

Multimodal Learning with Incomplete Modalities by Knowledge Distillation

Accepted to: KDD 2020

Presenter: Meng Ma

Time: 9:00-10:00 p.m. EST

Slides: link

Multimodal learning aims at utilizing information from a variety of data modalities to improve the generalization performance. One common approach is to seek the common information that is shared among different modalities for learning, whereas we can also fuse the supplementary information to leverage modality-specific information. Though the supplementary information is often desired, most existing multimodal approaches can only learn from samples with complete modalities, which wastes a considerable amount of data collected. Otherwise, model-based imputation needs to be used to complete the missing values and yet may introduce undesired noise, especially when the sample size is limited. In this paper, we proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities, and avoiding imputation and noise associated with it. Specifically, we first train models on each modality independently using all the available data. Then the trained models are used as teachers to teach the student model, which is trained with the samples having complete modalities. We demonstrate the effectiveness of the proposed method in extensive empirical studies on both synthetic datasets and real-world datasets.

2020-10-21

Introduction to Self-Supervised Representation Learning

Accepted to: tutorial

Presenter: Fengchun Qiao

Time: 9:00-10:00 p.m. EST

Slides: link

Given a task and enough labels, supervised learning can solve it really well. Good performance usually requires a decent amount of labels, but collecting manual labels is expensive (i.e. ImageNet) and hard to be scaled up. Considering the amount of unlabelled data (e.g. free text, all the images on the Internet) is substantially more than a limited number of human curated labelled datasets, it is kinda wasteful not to use them. However, unsupervised learning is not easy and usually works much less efficiently than supervised learning. What if we can get labels for free for unlabelled data and train unsupervised dataset in a supervised manner? We can achieve this by framing a supervised learning task in a special form to predict only a subset of information using the rest. In this way, all the information needed, both inputs and labels, has been provided. This is known as self-supervised learning.

2020-10-14

SinGAN: Learning a Generative Model from a Single Natural Image

Accepted to: ICCV 2019 (Oral)

Presenter: Tang Li

Time: 9:00-10:00 p.m. EST

Slides: link

We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

2020-10-07

CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM

Accepted to: CVPR 2018

Presenter: Nate

Time: 9:00-10:00 p.m. EST

Slides: link

The representation of geometry in real-time 3D perception systems continues to be a critical research issue. Dense maps capture complete surface shape and can be augmented with semantic labels, but their high dimensionality makes them computationally costly to store and process, and unsuitable for rigorous probabilistic inference. Sparse feature-based representations avoid these problems, but capture only partial scene information and are mainly useful for localisation only. We present a new compact but dense representation of scene geometry which is conditioned on the intensity data from a single image and generated from a code consisting of a small number of parameters. We are inspired by work both on learned depth from images, and auto-encoders. Our approach is suitable for use in a keyframe-based monocular dense SLAM system: While each keyframe with a code can produce a depth map, the code can be optimised efficiently jointly with pose variables and together with the codes of overlapping keyframes to attain global consistency. Conditioning the depth map on the image allows the code to only represent aspects of the local geometry which cannot directly be predicted from the image. We explain how to learn our code representation, and demonstrate its advantageous properties in monocular SLAM.

2020-09-30

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Accepted to: ACM MM 2019

Presenter: Hamed

Time: 9:00-10:00 p.m. EST

Slides: link

Abstract: Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

2020-09-23

Introduction to Graph Convolution Networks

Accepted to: None

Presenter: Pranjal Dhakal

Time: 9:00-10:00 p.m. EST