The schedule clusters one-credit courses in four-week segments to help you focus on specific topics during the 10-months. This is a full-time program. Courses are lab-oriented, and delivered in-person.

The program includes an eight to ten-week capstone project, allowing you to work with other students on real-life data sets and to apply techniques you have learned in the context of a larger set of data and more complex problems.

Course Descriptions

Programming for Data Science

Overview of data structures, iteration, flow control, and program design relevant to data exploration and analysis. When and how to exploit pre-existing libraries.

2017/2018 Instructors: Patrick Walls and Vincenzo Coia.

Computing Platforms for Data Science

How to install, maintain, and use the data scientific software “stack”. The Unix operating system, integrated development environments, and problem solving strategies.

2017/2018 Instructors: Giulio Dalla Riva.

Communication and Argumentation

Effective oral and written communication, across diverse target audiences, to facilitate understanding and decision-making. How to present and interpret data, with productive skepticism and an awareness of assumptions and bias.

2017/2018 Instructor: TBD.

Descriptive Statistics and Probability for Data Science

Fundamental concepts in probability. Describing data generated from a probability distribution. Statistical view of data coming from a probability distribution.

2017/2018 Instructor: Shaun Sun.

Data Wrangling

Converting data from the form in which it is collected to the form needed for analysis. How to clean, filter, arrange, aggregate, and transform diverse data types, e.g. strings, numbers, and date-times.

2017/2018 Instructor: Jenny Bryan.

Data Visualization I

The design and implementation of static figures across all phases of data analysis, from ingest and cleaning to description and inference. Plotting tools in R and Python.

2017/2018 Instructor: Vincenzo Coia.

Algorithms and Data Structures

How to choose and use appropriate algorithms and data structures to help solve data science problems. Key concepts such as recursion and algorithmic complexity (e.g., efficiency, scalability).

2017/2018 Instructors: Patrice Belleville.

Statistical Inference and Computation I

The statistical and probabilistic foundations of inference, developed jointly through mathematical derivations and simulation techniques. Important distributions and large sample results. The frequentist paradigm.

2017/2018 Instructor: Mike Marin.

Regression I

Linear models for a quantitative response variable, with multiple categorical and/or quantitative predictors. Matrix formulation of linear regression. Model assessment and prediction.

2017/2018 Instructor: Gabriela Cohen Freue.

Data Science Workflows

Interactive vs. scripted/unattended analyses and how to move fluidly between them. Reproducibility through automation and dynamic, literate documents. The use of version control and file organization to enhance machine- and human-readability.

2017/2018 Instructor: Tiffany Timbers.

Supervised Learning I

Introduction to supervised machine learning, with a focus on classification. K-NN, Decision trees, SVM, how to combine models via ensembling: boosting, bagging, random forests. Basic machine learning concepts such as generalization error and overfitting.

2017/2018 Instructor: Hyeju Jang.

Databases and Data Retrieval

How to work with data stored in relational database systems or in formats utilizing markup languages. Storage structures and schemas, data relationships, and ways to query and aggregate such data.

2017/2018 Instructor: Laks Lakshmanan.

Regression II

Useful extensions to basic regression, e.g., generalized linear models, mixed effects, smoothing, robust regression, and techniques for dealing with missing data.

2017/2018 Instructor: Lang Wu.

Feature and Model Selection

How to evaluate and select features and models. Cross-validation, ROC curves, feature engineering, the role of regularization. Automating these tasks with hyperparameter optimization.

2017/2018 Instructor: Mark Schmidt.

Unsupervised Learning

How to find groups and other structure in unlabeled, possibly high dimensional data. Dimension reduction for visualization and data analysis. Clustering, association rules, model fitting via the EM algorithm.

2017/2018 Instructor: TBD.

Collaborative Software Development

How to exploit practices from collaborative software development techniques in data scientific workflows. Appropriate use of abstraction and classes, the software life cycle, unit testing / continuous integration, and packaging for use by others.

2017/2018 Instructor: Meghan Allen.

Privacy, Ethics, and Security

The legal, ethical, and security issues concerning data, including aggregated data. Proactive compliance with rules and, in their absence, principles for the responsible management of sensitive data. Case studies.

2017/2018 Instructor: Ed Knorr.

Supervised Learning II

Stochastic gradient descent. Logistic Regression. Neural networks and deep learning: state-of-the-art implementation considerations in both software and hardware (GPUs).

2017/2018 Instructor: Mike Gelbart.

Web and Cloud Computing

How to use the web as a platform for data collection, computation, and publishing. Accessing data via scraping and APIs. Using the cloud for tasks that are beyond the capability of your local computing resources.

2017/2018 Instructor: Mike Feeley.

Statistical Inference and Computation II

Methods for dealing with the multiple testing problem. Bayesian reasoning for data science. How to formulate and implement inference using the prior-to-posterior paradigm.

2017/2018 Instructor: Paul Gustafson.

Advanced Machine Learning

Advanced machine learning methods, with an undercurrent of natural language processing (NLP) applications. Bag of words, recommender systems, topic models, ranking, natural language as sequence data, POS tagging, CRFs for named entity recognition and RNNs for text synthesis. An introduction to popular NLP libraries in Python.

2017/2018 Instructor: Mark Schmidt.

Spatial and Temporal Models

Model fitting and prediction in the presence of correlation due to temporal and/or spatial association. ARIMA models and Gaussian processes.

2017/2018 Instructor: Natalia Nolde.

Experimentation and Causal Inference

Statistical evidence from randomized experiments versus observational studies. Applications of randomization, e.g., A/B testing for website optimization.

2017/2018 Instructor: Paul Gustafson.

Data Visualization II

How to make principled and effective choices with respect to marks, spatial arrangement, and colour. Analysis, design, and implementation of interactive figures. How to provide multiple views, deal with complexity, and make difficult decisions about data reduction.

2017/2018 Instructor: Tamara Munzner.

Capstone Project

A mentored group project based on real data and questions from a partner within or outside the university. Students will formulate questions and design and execute a suitable analysis plan. The group will work collaboratively to produce a project report, presentation, and possibly other products, such as a web application.

2017/2018 Instructors: MDS Staff.