S. Levine, Chelsea Finn, Trevor Darrell, P. Abbeel
{"title":"End-to-End Training of Deep Visuomotor Policies","authors":"S. Levine, Chelsea Finn, Trevor Darrell, P. Abbeel","doi":"10.5555/2946645.2946684","DOIUrl":"https://doi.org/10.5555/2946645.2946684","url":null,"abstract":"Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"4 1 1","pages":"39:1-39:40"},"PeriodicalIF":0.0,"publicationDate":"2015-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90811761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Libra toolkit for probabilistic models","authors":"Daniel Lowd, Pedram Rooshenas","doi":"10.5555/2789272.2912077","DOIUrl":"https://doi.org/10.5555/2789272.2912077","url":null,"abstract":"The Libra Toolkit is a collection of algorithms for learning and inference with discrete probabilistic models, including Bayesian networks, Markov networks, dependency networks, and sum-product networks. Compared to other toolkits, Libra places a greater emphasis on learning the structure of tractable models in which exact inference is efficient. It also includes a variety of algorithms for learning graphical models in which inference is potentially intractable, and for performing exact and approximate inference. Libra is released under a 2-clause BSD license to encourage broad use in academia and industry.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"26 1","pages":"2459-2463"},"PeriodicalIF":0.0,"publicationDate":"2015-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81443275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pascal Germain, A. Lacasse, François Laviolette, M. Marchand, Jean-Francis Roy
{"title":"Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm","authors":"Pascal Germain, A. Lacasse, François Laviolette, M. Marchand, Jean-Francis Roy","doi":"10.5555/2789272.2831140","DOIUrl":"https://doi.org/10.5555/2789272.2831140","url":null,"abstract":"We propose an extensive analysis of the behavior of majority votes in binary classification. In particular, we introduce a risk bound for majority votes, called the C-bound, that takes into account the average quality of the voters and their average disagreement. We also propose an extensive PAC-Bayesian analysis that shows how the C-bound can be estimated from various observations contained in the training data. The analysis intends to be self-contained and can be used as introductory material to PAC-Bayesian statistical learning theory. It starts from a general PAC-Bayesian perspective and ends with uncommon PAC-Bayesian bounds. Some of these bounds contain no Kullback-Leibler divergence and others allow kernel functions to be used as voters (via the sample compression setting). Finally, out of the analysis, we propose the MinCq learning algorithm that basically minimizes the C-bound. MinCq reduces to a simple quadratic program. Aside from being theoretically grounded, MinCq achieves state-of-the-art performance, as shown in our extensive empirical comparison with both AdaBoost and the Support Vector Machine.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"81 1","pages":"787-860"},"PeriodicalIF":0.0,"publicationDate":"2015-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85481769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Existence and uniqueness of proper scoring rules","authors":"E. Ovcharov","doi":"10.5555/2789272.2886820","DOIUrl":"https://doi.org/10.5555/2789272.2886820","url":null,"abstract":"To discuss the existence and uniqueness of proper scoring rules one needs to extend the associated entropy functions as sublinear functions to the conic hull of the prediction set. In some natural function spaces, such as the Lebesgue Lp-spaces over Rd, the positive cones have empty interior. Entropy functions defined on such cones have directional derivatives only, which typically exist on large subspaces and behave similarly to gradients. Certain entropies may be further extended continuously to open cones in normed spaces containing signed densities. The extended densities are Gâteaux differentiable except on a negligible set and have everywhere continuous subgradients due to the supporting hyperplane theorem. We introduce the necessary framework from analysis and algebra that allows us to give an affirmative answer to the titular question of the paper. As a result of this, we give a formal sense in which entropy functions have uniquely associated proper scoring rules. We illustrate our framework by studying the derivatives and subgradients of the following three prototypical entropies: Shannon entropy, Hyvarinen entropy, and quadratic entropy.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"12 1","pages":"2207-2230"},"PeriodicalIF":0.0,"publicationDate":"2015-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76915820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michiel Hermans, M. C. Soriano, J. Dambre, P. Bienstman, Ingo Fischer
{"title":"Photonic delay systems as machine learning implementations","authors":"Michiel Hermans, M. C. Soriano, J. Dambre, P. Bienstman, Ingo Fischer","doi":"10.5555/2789272.2886817","DOIUrl":"https://doi.org/10.5555/2789272.2886817","url":null,"abstract":"Nonlinear photonic delay systems present interesting implementation platforms for machine learning models. They can be extremely fast, offer great degrees of parallelism and potentially consume far less power than digital processors. So far they have been successfully employed for signal processing using the Reservoir Computing paradigm. In this paper we show that their range of applicability can be greatly extended if we use gradient descent with backpropagation through time on a model of the system to optimize the input encoding of such systems. We perform physical experiments that demonstrate that the obtained input encodings work well in reality, and we show that optimized systems perform significantly better than the common Reservoir Computing approach. The results presented here demonstrate that common gradient descent techniques from machine learning may well be applicable on physical neuro-inspired analog computers.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"112 1","pages":"2081-2097"},"PeriodicalIF":0.0,"publicationDate":"2015-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76787368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discrete reproducing kernel Hilbert spaces: sampling and distribution of Dirac-masses","authors":"P. Jorgensen, Feng Tian","doi":"10.5555/2789272.2912098","DOIUrl":"https://doi.org/10.5555/2789272.2912098","url":null,"abstract":"We study reproducing kernels, and associated reproducing kernel Hilbert spaces (RKHSs) $mathscr{H}$ over infinite, discrete and countable sets $V$. In this setting we analyze in detail the distributions of the corresponding Dirac point-masses of $V$. Illustrations include certain models from neural networks: An Extreme Learning Machine (ELM) is a neural network-configuration in which a hidden layer of weights are randomly sampled, and where the object is then to compute resulting output. For RKHSs $mathscr{H}$ of functions defined on a prescribed countable infinite discrete set $V$, we characterize those which contain the Dirac masses $delta_{x}$ for all points $x$ in $V$. Further examples and applications where this question plays an important role are: (i) discrete Brownian motion-Hilbert spaces, i.e., discrete versions of the Cameron-Martin Hilbert space; (ii) energy-Hilbert spaces corresponding to graph-Laplacians where the set $V$ of vertices is then equipped with a resistance metric; and finally (iii) the study of Gaussian free fields.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"31 1","pages":"3079-3114"},"PeriodicalIF":0.0,"publicationDate":"2015-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77194683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework","authors":"Michael R. Thon, H. Jaeger","doi":"10.5555/2789272.2789276","DOIUrl":"https://doi.org/10.5555/2789272.2789276","url":null,"abstract":"Stochastic multiplicity automata (SMA) are weighted finite automata that generalize probabilistic automata. They have been used in the context of probabilistic grammatical inference. Observable operator models (OOMs) are a generalization of hidden Markov models, which in turn are models for discrete-valued stochastic processes and are used ubiquitously in the context of speech recognition and bio-sequence modeling. Predictive state representations (PSRs) extend OOMs to stochastic input-output systems and are employed in the context of agent modeling and planning. We present SMA, OOMs, and PSRs under the common framework of sequential systems, which are an algebraic characterization of multiplicity automata, and examine the precise relationships between them. Furthermore, we establish a unified approach to learning such models from data. Many of the learning algorithms that have been proposed can be understood as variations of this basic learning scheme, and several turn out to be closely related to each other, or even equivalent.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"13 1","pages":"103-147"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75702845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preface to this special issue","authors":"A. Gammerman, V. Vovk","doi":"10.5555/2789272.2886803","DOIUrl":"https://doi.org/10.5555/2789272.2886803","url":null,"abstract":"This issue of JMLR is devoted to the memory of Alexey Chervonenkis. Over the period of a dozen years between 1962 and 1973 he and Vladimir Vapnik created a new discipline of statistical learning theory—the foundation on which all our modern understanding of pattern recognition is based. Alexey was 28 years old when they made their most famous and original discovery, the uniform law of large numbers. In that short period Vapnik and Chervonenkis also introduced the main concepts of statistical learning theory, such as VCdimension, capacity control, and the Structural Risk Minimization principle, and designed two powerful pattern recognition methods, Generalised Portrait and Optimal Separating Hyperplane, later transformed by Vladimir Vapnik into Support Vector Machine—arguably one of the best tools for pattern recognition and regression estimation. Thereafter Alexey continued to publish original and important contributions to learning theory. He was also active in research in several applied fields, including geology, bioinformatics, medicine, and advertising. Alexey tragically died in September 2014 after getting lost during a hike in the Elk Island park on the outskirts of Moscow. Vladimir Vapnik suggested to prepare an issue of JMLR to be published at the first anniversary of the death of his long-term collaborator and close friend. Vladimir and the editors contacted a few dozen leading researchers in the fields of machine learning related to Alexey’s research interests and had many enthusiastic replies. In the end eleven papers were accepted. This issue also contains a first attempt at a complete bibliography of Alexey Chervonenkis’s publications. Simultaneously with this special issue will appear Alexey’s Festschrift (Vovk et al., 2015), to which the reader is referred for information about Alexey’s research, life, and death. The Festschrift is based in part on a symposium held in Pathos, Cyprus, in 2013 to celebrate Alexey’s 75th anniversary. Apart from research contributions, it contains Alexey’s reminiscences about his early work on statistical learning with Vladimir Vapnik, a reprint of their seminal 1971 paper, a historical chapter by R. M. Dudley, reminiscences of Alexey’s and Vladimir’s close colleague Vasily Novoseltsev, and three reviews of various measures of complexity used in machine learning (“Measures of Complexity” is both the name of the symposium and the title of the book). Among Alexey’s contributions to machine learning (mostly joint with Vladimir Vapnik) discussed in the book are:","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"30 1","pages":"1677-1681"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73946949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised interpolation in an anticausal learning scenario","authors":"D. Janzing, B. Scholkopf","doi":"10.5555/2789272.2886811","DOIUrl":"https://doi.org/10.5555/2789272.2886811","url":null,"abstract":"According to a recently stated 'independence postulate', the distribution Pcause contains no information about the conditional Peffect|cause while Peffect may contain information about Pcause|effect. Since semi-supervised learning (SSL) attempts to exploit information from PX to assist in predicting Y from X, it should only work in anticausal direction, i.e., when Y is the cause and X is the effect. In causal direction, when X is the cause and Y the effect, unlabelled x-values should be useless. To shed light on this asymmetry, we study a deterministic causal relation Y = f(X) as recently assayed in Information-Geometric Causal Inference (IGCI). Within this model, we discuss two options to formalize the independence of PX and f as an orthogonality of vectors in appropriate inner product spaces. We prove that unlabelled data help for the problem of interpolating a monotonically increasing function if and only if the orthogonality conditions are violated - which we only expect for the anticausal direction. Here, performance of SSL and its supervised baseline analogue is measured in terms of two different loss functions: first, the mean squared error and second the surprise in a Bayesian prediction scenario.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"9 1","pages":"1923-1948"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84605148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A view of margin losses as regularizers of probability estimates","authors":"Hamed Masnadi-Shirazi, N. Vasconcelos","doi":"10.5555/2789272.2912087","DOIUrl":"https://doi.org/10.5555/2789272.2912087","url":null,"abstract":"Regularization is commonly used in classifier design, to assure good generalization. Classical regularization enforces a cost on classifier complexity, by constraining parameters. This is usually combined with a margin loss, which favors large-margin decision rules. A novel and unified view of this architecture is proposed, by showing that margin losses act as regularizers of posterior class probabilities, in a way that amplifies classical parameter regularization. The problem of controlling the regularization strength of a margin loss is considered, using a decomposition of the loss in terms of a link and a binding function. The link function is shown to be responsible for the regularization strength of the loss, while the binding function determines its outlier robustness. A large class of losses is then categorized into equivalence classes of identical regularization strength or outlier robustness. It is shown that losses in the same regularization class can be parameterized so as to have tunable regularization strength. This parameterization is finally used to derive boosting algorithms with loss regularization (BoostLR). Three classes of tunable regularization losses are considered in detail. Canonical losses can implement all regularization behaviors but have no flexibility in terms of outlier modeling. Shrinkage losses support equally parameterized link and binding functions, leading to boosting algorithms that implement the popular shrinkage procedure. This offers a new explanation for shrinkage as a special case of loss-based regularization. Finally, α-tunable losses enable the independent parameterization of link and binding functions, leading to boosting algorithms of great exibility. This is illustrated by the derivation of an algorithm that generalizes both AdaBoost and LogitBoost, behaving as either one when that best suits the data to classify. Various experiments provide evidence of the benefits of probability regularization for both classification and estimation of posterior class probabilities.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"17 1","pages":"2751-2795"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}