Before training a neural network, decisions about the size of the network need to be made. Frequently this is non-obvious and hard to select. Instead of choosing in advance, auto-sizing uses complex regularizers to prune neurons during training. Our toolkit (PyPI, GitHub) integrates with any pytorch model with only 3 lines of code.
Neural Machine Translation systems favor shorter translations &emdash; often even if they are incorrect. I am interested in the theoretical and experimental reasons as to why.
Neural networks often require large amounts of training data in order to get good performance. I am also interested in how to leverage out-of-domain data to improve performance in low-resource settings.
Anomalous pattern detection is a popular subfield in computer science aimed at detecting anomalous items and groupings of items in a dataset using methods from machine learning, data mining, and statistics. For anomaly detection tasks consisting of geospatially and temporally labeled data, spatial scan statistics have been successfully applied to numerous spatiotemporal data mining and pattern detection problems such as predicting crime waves or outbreaks of diseases [12, 7, 14, 15]. However, spatial scan statistics are limited by the ability to only scan over a structured set of data streams. When spatiotemporal data sets contain unstructured free text, spatial scan statistics require preprocessing data into structured categories. Manual labeling and annotating text can be time consuming or infeasible, while automatic classification methods that assign text fields into a pre-defined set of event types can obscure the occurrence of novel events - such as a disease outbreak with a previously unseen pattern of symptoms - potentially drowning out the signal of the exact outliers the method is attempting to detect.
In this thesis, we propose the Semantic Scan Statistic, which integrates spatial scanning with unsupervised topic modeling to enable timely and ac- curate detection of novel disease outbreaks. We discuss some of the inherent challenges of working with free text data in an anomalous pattern detection framework, and we present some novel approaches to the problem using topic models by focusing on specifically adapting topic modeling algorithms to enable anomaly detection. We evaluate our approach using two years of free-text Emergency Department chief complaint data from Allegheny Country, PA, demonstrating the efficacy of the Semantic Scan Statistic and the benefits of incorporating unstructured text for spatial event detection. Using semi-synthetic disease outbreaks, a common evaluation method of the disease surveillance field, we show the ability to detect outbreaks of diseases is over 25% faster than current state-of-the-art methods that do not use textual information.
Machine Translation, the subfield of Computer Science that focuses on translating between two human languages, has greatly benefited from neural networks. However, these neural machine translation systems have complicated architectures with many hyperparameters that need to be manually chosen. Frequently, these are selected either through a grid search over values, or by using values commonplace in the literature. However, these are not theoretically justified and the same values are not optimal for all language pairs and datasets.
Fortunately, the innate structure of the problem allows for optimization of these hyperparameters during training. Traditionally, the hyperparameters of a system are chosen and then a learning algorithm optimizes all of the parameters within the model. In this work, I propose three methods to learn the optimal hyperparameters during the training of the model, allowing for one step instead of two. First, I propose using group regularizers to learn the number, and size of, the hidden neural network layers. Second, I demonstrate how to use a perceptron-like tuning method to solve known problems of undertranslation and label bias. Finally, I propose an Expectation-Maximization based method to learn the optimal vocabulary size and granularity. Using various techniques from machine learning and numerical optimization, this dissertation covers how to learn hyperparameters of a Neural Machine Translation system while training the model itself.
Latent Dirichlet allocation, or LDA, is a successful, generative, probabilistic model of text corpora that has performed well in many tasks in many areas of Natural Language Processing. Despite being perfectly suited for Automatic Summarization tasks, it has never been applied to them. In this paper, I introduce Summarization by LDA, or SLDA, which better models the subtopics of a document leading to more pertinent, relevant, and concise summaries than other summarization methods. This new approach is competitive with the leading methods in the field and even outperforms them in many aspects. In addition to SLDA, I introduce a novel, paradigm-shifting, evaluation technique of summarization that does not rely on gold-standards. It overcomes many of the challenges imposed by inherent disagreements amongst people of what a good summary is by evaluating over large numbers of people using the commercial service, Mechanical Turk. Overall, this paper lays the ground work for transforming the conventions of the Automatic Summarization field by challenging many definitions.
kenton AT jhu DOT edu (Preferred)