Math 410-02, Data Science: Theory and Applications, Spring 2015





Organizers: Junping Shi (jxshix@wm.edu), Gexin Yu  (gyu@wm.edu)

Time and Location: Wednesday 2:00--2:50pm,   Small Hall 235

Webpage:  http://jxshix.people.wm.edu/Math410-2015/index.html

W&M EXTREEMS-QED program website

Purpose and Goals:  The purpose of this one credit Math 410 course is to introduce students to big data analysis, data science and possible undergraduate research projects in these topics at William and Mary.  The format will consist mainly of weekly talks by faculty followed by class discussions and/or exercises related to the presented topics.  The typical student in this course will be in his or her sophomore or junior year and will have an interest in pursuing a research project related to computational mathematics.  For many, this course can serve as a gateway to establishing a research project through the EXTREEMS-QED program.  

Course Grade:  The course grade will be based on attendance and participation.  Students may miss 1 of the 14 talks without penalty.  Students may earn extra credit for attending EXTREEMS-QED/Math Department colloquia and other appropriate talks listed below.  Attendance of classes and colloquium talks will be recorded by the organizers, and after each lecture, write (type) a summary of the talk and turn it in before next lecture. More specifically, with total of 100 points, the attendance of each talk is 4 points, participation (which includes asking/answering questions in class), and homework (which includes presentation summaries and assignments given by the speaker) account for 4 points. Attendance of each eligible Math colloquium is 2 extra points.

Schedule:





Date
Title
Speaker
Abstract/Reading material/video
Week 1 (1/21)
Big (or small) data and William & Mary
EXTREEMS-QED program

Junping Shi (mathematics)
article: Data Driven: The New Big Science
video: explaining big data (8 min)
Week 2(1/28)
Ensemble Trees and CLTs: Statistical Inference in Machine Learning
Lucas Mentch (Cornell University)
Abstract:  Machine learning algorithms are typically seen as prediction-only tools, meaning that the interpretability and intuition provided by a traditional statistical modeling approach are sacrificed in order to achieve superior predictions.  In this talk, we argue that this black-box perspective need not always be the case.  After contrasting the traditional statistical and machine learning approaches to data analysis, we demonstrate that predictions from tree-based ensemble learners like bagged trees and random forests, when appropriately structured, can be viewed as extended versions of U-statistics.  Given this framework, we derive central limit theorems (CLTs) for predictions and derive a consistent estimate of variance that may be computed at no additional cost, which allows for formal statistical inference to be carried out in practice.  In particular, we produce confidence intervals to accompany predictions and define formal hypothesis tests for both additivity and feature significance. When a large test set is required, we extend our testing procedures and utilize random projections to accommodate the potential p>>n setting.  These tools are illustrated on data provided by Cornell University's Lab of Ornithology.
Week 3 (2/4)
Transformations on matrices and tensors
Chi-Kwong Li (mathematics)
We describe some results and problems on transformations of matrices and tensors that preserve some special properties. The connections of such problems to the study of operations
leaving invariant important features of large data sets will be mentioned.
Week 4 (2/11)
Composite Empirical Likelihood: A Derivation of Multiple Non Parametric Likelihood Objects
Adam Jaeger (University of Georgia)
The likelihood function plays a pivotal role in statistical inference because it is easy to work with and the resultant estimators are known to have good properties. However, these results hinge on correct specification of the likelihood as the true data-generating mechanism.  Many modern problems involve extremely complicated distribution functions, which may be difficult -- if not impossible -- to express explicitly.  This is a serious barrier to the likelihood approach, which requires the specification of a model.  Non-parametric methods are one way to avoid the problem of having to specify a particular data-generating mechanism, but can be computationally intensive reducing their accessibility for large data problems. We propose a new approach that combines multiple non-parametric likelihood-type objects to build a data-driven approximation of the true function.  We build on two alternative likelihood approaches, empirical and composite likelihood, taking advantage of the strengths of each.  Specifically, from empirical likelihood we borrow the ability to avoid a parametric specification, and from composite likelihood we gain a  decrease in computational load.  In this talk, I define the general form of the composite empirical likelihood, derive some of the asymptotic properties of this new class, and explore some applications of the approach.
Week 5 (2/18)
Topological Analysis on Firn Data Yu-Min Chung
(mathematics)
Firn is a type of snow and is at an intermediate stage between snow and glacial ice.  Most importantly, its structure reveals the climate information in certain ways.  We use computational topology, which is a natural way to describe the shape of an object, to explore the firn structure.  The main difficulty is that the data is noisy, but with the topological tool, in particular persistent homology, the robust features can be extracted from the noisy data.  In this talk, we will give a brief introduction to computational topology and persistence homology, and analyze the firn structure by these topological tools .  This is a joint work with Professor Sarah Day, and Doctor Kaitlin Keegan at University of Copenhagen.
Week 6 (2/25)
Conway's Orbifold Notation for the 17 Symmetry Types of Repeating Planar Patterns
Greg Smith
(applied sci)
I will present a gentle introduction to the 17 symmetry types for repeating patterns in the plane (a.k.a., the plane crystallographic groups) and Conway's "orbifold notation" that draws on work of William Thurston and Macbeath.  This language is described in the reading (Conway and Huson, 2002) and in a beautifully illustrated book entitled "The Symmetries of Things" by John H. Conway, Heidi Burgiel and Chaim Goodman-Strauss (2008).    
Week 7 (3/4)
Problems in graph theory
Gexin Yu
(mathematics)

Week 8 (3/11)
Spring Break (no class)

Week 9 (3/18)
Sensitivity Analysis for Unmeasured Confounding in Linear Structural Equation Models Adam Sullivan (Harvard University) We consider the biases that may arise when an unmeasured confounder is omitted from a structural equation model (SEM) and sensitivity analysis techniques to correct for such biases. We give an analysis of which effects in an SEM are and are not biased by an unmeasured confounder. It is shown that a single unmeasured confounder will bias not just one but numerous effects in an SEM. We present sensitivity analysis techniques to correct for biases in total, direct, and indirect effects when using SEM analyses, and illustrate these techniques with a study of aging and cognitive function.
Week 10 (3/25)
Data science in action

churn.R churn_fix.csv
Ji Li (Yesware Inc.)

The talk will focus on illustrating to you what day-to-day data science work looks like - a combination of ETL, data munging, data exploration, machine learning, and delivery of results. Concrete examples will be given through in-class demo. No prior knowledge is required for you to follow along.

Week 11 (4/1)
Some results on preservers on tensor products of Hermitian matrices
Ajda Fosner
(University of Primorska, Slovenia)
The study of linear maps on matrices, operators or other algebraic objects that leave invariant certain functions, subsets
or relations is now commonly referred to as the study of linear preserver problems. In the last few decades a lot of results on linear preservers on matrix algebras as well as on more general rings and operator algebras have been obtained. Recently, many mathematicians have raised questions combining linear preserver problems with quantum information science. In this seminar, we will present some recent results on linear preserver problems on tensor products of matrices. In particular, we will talk about spectrum and spectral radius preservers, rank-one preservers, and determinant preservers on tensor products of Hermitian matrices.
Week 12 (4/8)
Space, time, and AidData: How to make information actionable (and, why it's really quite hard) Dan Runfola
(Aid Data)
AidData tracks trillions of dollars in funding for development. Now anyone can assess who is funding what, where and to what effect. Donors and governments can maximize the impact of their investments. Citizens can hold their leaders to account for results. And, we can give statistics a run for it's money.  In this talk, Dr. Runfola will be discussing the wide variety of ways that AidData's geographically-referenced information is being used to understand international aid targetting and effectiveness, and the many challenges still remaining in doing so effectively. In particular we will cover issues of spatial uncertainty, the quantification of spillover effects in traditional causal identification modeling approaches, the integration of human perception into quantified datasets, and computational challenges in broad-scope satellite imagery processing.  
Week 13 (4/15)
The effect of local transposon density on homeolog specific expression differences in allopolyploid monkey-flower
Ron Smith
(applied sci)
Allopolyploids form by hybridization of two species coupled with whole genome duplication. Young allopolyploid lineages face the unique challenge of organizing two genomes contributed by different parent species that evolved in separate contexts.  In the wake of allopolyploidy, homeologous genes (homologous genes derived from different parental subgenomes) are often expressed at non-equal levels. Typically, genome-wide patterns of homeolog expression bias are highly skewed—one parental genome (subgenome) is expressed at higher levels than the other.  Here we use allopolyploid species in the Mimulus genus as models to understand how genome structure, specifically transposon class and local transposon density, explain homeolog specific expression differences.  A mechanistic understanding of these phenomena is fundamental to understanding allopolyploid gene expression and evolution, subgenome fractionation and genomic dynamics subsequent to whole genome duplication, an important factor in the evolutionary histories of plants, fungi and vertebrates.  [Joint work with Gregory D. Smith and Joshua Puzey.]

Background reading: 

Nina V. Fedoroff.  Presidential Address: Transposable elements, epigenetics, and genome evolution.  Science.  338(6108):758-67, 2012.

http://www.sciencemag.org/content/338/6108/758.long
Week 14 (4/22)

Margaret Saha
(biology)

Week 15 (4/29)
Mathematical modeling of restoration of Chesapeake Bay oysters Leah Shaw (applied science)

Colloquium Talks related to big data analysis and suitable for Undergraduate students: (normally Friday 2-2:50pm, Jones Hall 301)

1.  Mathematics Colloquium and EXTREEMS-QED Lecture: Kevin McGoff (Duke University), Friday, January 23rd 2015, 2pm - 3pm. Title: Statistical inference for dynamical systems. https://events.wm.edu/event/view/mathematics/49700

2. Mathematics Colloquium and EXTREEMS-QED Lecture: Irem Sengul (North Carolina State University), Friday, January 30th 2015, 2pm - 3pm. Title: Modeling for the Equitable and Effective Food Distribution under Stochastic Capacities. https://events.wm.edu/event/view/mathematics/49702

3. Mathematics Colloquium and EXTREEMS-QED Lecture: Guodong (Gordon) Pang (Penn State University), Friday, February 6th 2015, 2pm - 3pm. Title: Large-Scale Fork-Join Networks with Non-Exchangeable Synchronization. https://events.wm.edu/event/view/mathematics/49704

4. Mathematics Colloquium and EXTREEMS-QED Lecture: Anh Ninh (Rutgers University), Friday, February 13th 2015, 2pm - 3pm. Title: Recruitment stocking problem. https://events.wm.edu/event/view/mathematics/50884

5. Mathematics Colloquium and EXTREEMS-QED Lecture: Guannan Wang (University of Georgia), Friday, March 20th 2015, 3pm - 4pm. https://events.wm.edu/event/view/mathematics/52160




Adapted from this page,  modified by Junping Shi, Spring 2015.