![]() |
![]() |
Purpose
and
Goals: The purpose of this one
credit Math 410 course is to introduce students to big
data analysis, data science and possible undergraduate
research projects in these topics at William and
Mary. The format will consist mainly of weekly
talks by faculty followed by class discussions and/or
exercises related to the presented topics. The
typical student in this course will be in his or her
sophomore or junior year and will have an interest in
pursuing a research project related to computational
mathematics. For many, this course can serve as
a gateway to establishing a research project through
the EXTREEMS-QED
program.
Date |
Title |
Speaker |
Abstract/Reading
material/video |
Week 1 (1/21) |
Big (or small) data and
William & Mary EXTREEMS-QED program |
Junping Shi
(mathematics) |
article: Data
Driven: The New Big Science video: explaining big data (8 min) |
Week 2(1/28) |
Ensemble Trees and
CLTs: Statistical Inference in Machine Learning |
Lucas
Mentch (Cornell University) |
Abstract: Machine
learning algorithms are typically seen as
prediction-only tools, meaning that the interpretability
and intuition provided by a traditional statistical
modeling approach are sacrificed in order to achieve
superior predictions. In this talk, we argue that
this black-box perspective need not always be the
case. After contrasting the traditional
statistical and machine learning approaches to data
analysis, we demonstrate that predictions from
tree-based ensemble learners like bagged trees and
random forests, when appropriately structured, can be
viewed as extended versions of U-statistics. Given
this framework, we derive central limit theorems (CLTs)
for predictions and derive a consistent estimate of
variance that may be computed at no additional cost,
which allows for formal statistical inference to be
carried out in practice. In particular, we produce
confidence intervals to accompany predictions and define
formal hypothesis tests for both additivity and feature
significance. When a large test set is required, we
extend our testing procedures and utilize random
projections to accommodate the potential p>>n
setting. These tools are illustrated on data
provided by Cornell University's Lab of Ornithology. |
Week 3 (2/4) |
Transformations
on matrices and tensors |
Chi-Kwong Li
(mathematics) |
leaving invariant important features of large data sets will be mentioned. |
Week 4 (2/11) |
Composite Empirical Likelihood: A
Derivation of Multiple Non Parametric Likelihood
Objects |
Adam
Jaeger (University of Georgia) |
The likelihood function plays a pivotal
role in statistical inference because it is easy to work
with and the resultant estimators are known to have good
properties. However, these results hinge on correct
specification of the likelihood as the true
data-generating mechanism. Many modern problems
involve extremely complicated distribution functions,
which may be difficult -- if not impossible -- to
express explicitly. This is a serious barrier to
the likelihood approach, which requires the
specification of a model. Non-parametric methods
are one way to avoid the problem of having to specify a
particular data-generating mechanism, but can be
computationally intensive reducing their accessibility
for large data problems. We propose a new approach that
combines multiple non-parametric likelihood-type objects
to build a data-driven approximation of the true
function. We build on two alternative likelihood
approaches, empirical and composite likelihood, taking
advantage of the strengths of each. Specifically,
from empirical likelihood we borrow the ability to avoid
a parametric specification, and from composite
likelihood we gain a decrease in computational
load. In this talk, I define the general form of
the composite empirical likelihood, derive some of the
asymptotic properties of this new class, and explore
some applications of the approach. |
Week 5 (2/18) |
Topological Analysis on Firn Data | Yu-Min Chung (mathematics) |
Firn is a type of snow and is at an
intermediate stage between snow and glacial
ice. Most importantly, its structure reveals the
climate information in certain ways. We use
computational topology, which is a natural way to
describe the shape of an object, to explore the firn
structure. The main difficulty is that the data
is noisy, but with the topological tool, in particular
persistent homology, the robust features can be
extracted from the noisy data. In this talk, we
will give a brief introduction to computational
topology and persistence homology, and analyze the
firn structure by these topological tools . This
is a joint work with Professor Sarah Day, and
Doctor Kaitlin Keegan at University of
Copenhagen.
|
Week 6 (2/25) |
Conway's Orbifold
Notation for the 17 Symmetry Types of Repeating Planar
Patterns |
Greg
Smith (applied sci) |
I will present a gentle introduction to the 17 symmetry types for repeating patterns in the plane (a.k.a., the plane crystallographic groups) and Conway's "orbifold notation" that draws on work of William Thurston and Macbeath. This language is described in the reading (Conway and Huson, 2002) and in a beautifully illustrated book entitled "The Symmetries of Things" by John H. Conway, Heidi Burgiel and Chaim Goodman-Strauss (2008). |
Week 7 (3/4) |
Problems in graph
theory |
Gexin Yu (mathematics) |
|
Week 8 (3/11) |
Spring Break (no class) | ||
Week 9 (3/18) |
Sensitivity Analysis for Unmeasured Confounding in Linear Structural Equation Models | Adam Sullivan (Harvard University) | We consider the biases that may arise when an unmeasured confounder is omitted from a structural equation model (SEM) and sensitivity analysis techniques to correct for such biases. We give an analysis of which effects in an SEM are and are not biased by an unmeasured confounder. It is shown that a single unmeasured confounder will bias not just one but numerous effects in an SEM. We present sensitivity analysis techniques to correct for biases in total, direct, and indirect effects when using SEM analyses, and illustrate these techniques with a study of aging and cognitive function. |
Week 10 (3/25) |
Data science in
action churn.R churn_fix.csv |
Ji Li (Yesware Inc.) |
The talk will focus on illustrating to you what day-to-day data science work looks like - a combination of ETL, data munging, data exploration, machine learning, and delivery of results. Concrete examples will be given through in-class demo. No prior knowledge is required for you to follow along. ![]() |
Week 11 (4/1) |
Some results on
preservers on tensor products of Hermitian matrices |
Ajda Fosner (University of Primorska, Slovenia) |
The study of linear maps
on matrices, operators or other algebraic objects that
leave invariant certain functions, subsets or relations is now commonly referred to as the study of linear preserver problems. In the last few decades a lot of results on linear preservers on matrix algebras as well as on more general rings and operator algebras have been obtained. Recently, many mathematicians have raised questions combining linear preserver problems with quantum information science. In this seminar, we will present some recent results on linear preserver problems on tensor products of matrices. In particular, we will talk about spectrum and spectral radius preservers, rank-one preservers, and determinant preservers on tensor products of Hermitian matrices. |
Week 12 (4/8) |
Space, time, and AidData: How to make information actionable (and, why it's really quite hard) | Dan Runfola (Aid Data) |
AidData tracks trillions of dollars in funding for development. Now anyone can assess who is funding what, where and to what effect. Donors and governments can maximize the impact of their investments. Citizens can hold their leaders to account for results. And, we can give statistics a run for it's money. In this talk, Dr. Runfola will be discussing the wide variety of ways that AidData's geographically-referenced information is being used to understand international aid targetting and effectiveness, and the many challenges still remaining in doing so effectively. In particular we will cover issues of spatial uncertainty, the quantification of spillover effects in traditional causal identification modeling approaches, the integration of human perception into quantified datasets, and computational challenges in broad-scope satellite imagery processing. |
Week 13 (4/15) |
The effect of local transposon density on
homeolog specific expression differences in
allopolyploid monkey-flower
|
Ron Smith (applied sci) |
Allopolyploids form by
hybridization of two species coupled with whole genome
duplication. Young allopolyploid lineages face the
unique challenge of organizing two genomes contributed
by different parent species that evolved
in separate contexts. In the wake of
allopolyploidy, homeologous genes (homologous genes
derived from different parental subgenomes) are
often expressed at non-equal levels. Typically,
genome-wide patterns of homeolog expression bias
are highly skewed—one parental genome (subgenome) is
expressed at higher levels than the
other. Here we use allopolyploid species in
the Mimulus genus as models to understand how genome
structure, specifically transposon class and local
transposon density, explain homeolog specific expression
differences. A mechanistic understanding of
these phenomena is fundamental to understanding
allopolyploid gene expression and
evolution, subgenome fractionation and genomic
dynamics subsequent to whole genome duplication, an
important factor in the evolutionary histories of
plants, fungi and vertebrates. [Joint work
with Gregory D. Smith and Joshua Puzey.] Background reading: Nina V. Fedoroff. Presidential Address: Transposable elements, epigenetics, and genome evolution. Science. 338(6108):758-67, 2012. http://www.sciencemag.org/ |
Week 14 (4/22) |
|
Margaret
Saha (biology) |
|
Week 15 (4/29) |
Mathematical modeling of restoration of Chesapeake Bay oysters | Leah Shaw (applied
science) |
Colloquium Talks related to
big data analysis and suitable for Undergraduate students: (normally Friday 2-2:50pm,
Jones Hall 301)
2. Mathematics Colloquium and EXTREEMS-QED Lecture: Irem
Sengul (North Carolina State University), Friday, January 30th
2015, 2pm - 3pm. Title: Modeling for the Equitable and
Effective Food Distribution under Stochastic Capacities. https://events.wm.edu/event/view/mathematics/49702
3. Mathematics Colloquium and EXTREEMS-QED Lecture: Guodong
(Gordon) Pang (Penn State University), Friday, February 6th
2015, 2pm - 3pm. Title: Large-Scale Fork-Join Networks with
Non-Exchangeable Synchronization. https://events.wm.edu/event/view/mathematics/49704
4. Mathematics Colloquium and EXTREEMS-QED Lecture: Anh Ninh
(Rutgers University), Friday, February 13th 2015, 2pm - 3pm.
Title: Recruitment stocking problem. https://events.wm.edu/event/view/mathematics/50884
5. Mathematics Colloquium and EXTREEMS-QED Lecture: Guannan Wang (University of Georgia), Friday, March 20th 2015, 3pm - 4pm. https://events.wm.edu/event/view/mathematics/52160