Math 410-07, Big Data Analysis, Spring 2014





Organizers: Junping Shi (jxshix@wm.edu), Tanujit Dey (tdey@wm.edu)

Time and Location: Wednesday 2:00--2:50pm,   Small Hall 233

Webpage:  http://jxshix.people.wm.edu/Math410-2014/index.html

W&M EXTREEMS-QED program website

Purpose and Goals:  The purpose of this one credit Math 410 course is to introduce students to big data analysis, data science and possible undergraduate research projects in these topics at William and Mary.  The format will consist mainly of weekly talks by faculty followed by class discussions and/or exercises related to the presented topics.  The typical student in this course will be in his or her sophomore or junior year and will have an interest in pursuing a research project related to computational mathematics.  For many, this course can serve as a gateway to establishing a research project through the EXTREEMS-QED program.  

Course Grade:  The course grade will be based on attendance and participation.  Students may miss 1 of the 12 talks without penalty.  Students may earn extra credit for attending EXTREEMS-QED/Math Department colloquia and other appropriate talks listed below.  Attendance of classes and colloquium talks will be recorded by the organizers, and after each lecture, write (type) a summary of the talk and turn it in before next lecture. More specifically, with total of 100 points, the attendance of each talk is 4 points, participation (which includes asking/answering questions in class and on the discussion board), and homework (which includes presentation summaries and assignments given by the speaker) account for 5 points. Attendance of each eligible Math colloquium is 2 extra points.

Schedule:





Date
Title
Speaker
Abstract/Reading material/video
Week 2 (1/22)
What is big data?
Junping Shi (mathematics)
article: Data Driven: The New Big Science
video: explaining big data (8 min)
Week 3 (1/29)

cancelled due to snow

Week 4 (2/5)
Data from a Stone: Aid Transparency, PDF Ghettos, and Data Mining Albert Decatur (AidData) Records of international aid are elusive and sparse, but data about international aid are sorely needed. While individual aid projects can be well understood, donors and recipients do not have a clear understanding of where the sum of aid money goes or what it is used for.  Even if a donor or recipient knows exactly what they're doing they may choose to hide their activities from the international community because their work could be seen as selfish or unsavory.  We are certain that this lack of open knowledge about aid leads to sub-optimal allocation affecting hundreds of millions of people for the worse.

AidData is one organization among many working to increase aid transparency through open data.  
To find out more about an aid project we look at the documents that have already been produced.  So far we've worked with human coders, who we trust.  But aid documents are probably being written far faster than our coders can read them, and we have yet to get through an enormous backlog of documents.  We'd like to partner with skilled and creative mathematicians and computer scientists to mine our documents for data.
Are you ready to contribute?
Week 5 (2/12)
Using a Computer Algebra System in Data Analysis Larry Leemis (mathematics)

Week 6 (2/19)
Adaptive Social Networks Leah Shaw (applied science)

Week 7 (2/26)
Decomposition of quantum gates Chi-Kwong Li (mathematics)
In quantum computing, quantum operations are applied to quantum states to process information. Mathematically quantum states are represented by complex vectors, and quantum operations are represented by unitary transforms. It is important to derive efficient scheme to implement unitary transforms because of the hardware constraints and the very high dimensional space under consideration. In this talk, we will describe some recent work by me with undergraduate and graduate students,  and future research directions in this line of study.
Week 8 (3/5)
Spring Break (no class)

Week 9 (3/12)
Saving Infants in a Heartbeat!
John Delos (physics)

Week 9 (3/19)
"Big Data" from RNA-Seq Experiments Margaret Saha (biology)

The introduction of “next-generation” sequencing technology has allowed biologists to obtain unprecedented amount of sequence data in short periods of time. In particular, this technology has revolutionized our ability to analyze gene expression on a global level through RNA-Seq—a method that converts the RNA in a given sample into cDNA, which is then sequenced.  RNA-Seq is quickly becoming an essential tool for every field of biology—from biomedicine and drug development to evolution and ecology.  A typical RNA-Seq experiment can produce 30-100 million bases in less than three hours.  However the sheer amount of data and the unanticipated complexity of the transcriptome (the complete collection of transcribed RNA) have made data analysis extremely challenging and have necessitated the development of novel statistical and computational tools.  In this lecture we will review the nature of RNA-Seq data and discuss the major challenges (and opportunities!) presented by these data sets.
Computational methods for transcriptome annotation and quantification using RNA-seq, by Garber et.al., Nature Methods, 2011

Week 10 (3/26)
Visual and Virtual Data: Using Simulation to Manage Your Expectations Greg Smith (applied science)

Week 11 (4/2)
Data Mining at NASA Langley Research Center
Nipa Phojanamongkolkij (NASA), Ersin Ancel (NASA)
In the NASA Aviation Safety program data mining, especially text mining, within the narrative section of accident/incident reports from NTSB (National Transportation Safety Board) is needed. There are many incident reports that would take too long for individuals to read all narratives and to find key phrases for incident precursors.
Week 12 (4/9)
Exploratory Methods for the Integrated Analysis of Multi-Source Data

Eric Lock (Duke University)
Research in molecular biology and other fields often requires the analysis of datasets in which multiple related sources of data are available for a common sample set.  We describe two exploratory methods for the integrated analysis of such datasets:  Joint and Individual Variation Explained (JIVE) and Bayesian Consensus Clustering (BCC).  JIVE gives a general decomposition of variation  consisting of three terms: a low-rank approximation capturing joint variation across data sources, low-rank approximations capturing structured variation individual to each data source, and residual noise. JIVE quantifies the amount of joint variation between data sources, reduces the dimensionality of the data in an insightful way, and allows for the visual exploration of joint and individual structure.  BCC is a tool to cluster a set of objects based on multi-source data. The Bayesian model permits a separate clustering of the objects for each data source that adhere loosely to an overall clustering.  We illustrate the above methods with applications to publicly available data from The Cancer Genome Atlas. This is joint work with collaborators at The University of North Carolina and Duke University. 
Week 13 (4/16)
Computing phylogenetic trees (quickly) Anke van Zuylen (mathematics) and Jamie Bieron I will be talking about a research project that I have been working on with two students, and one of them (Jamie Bieron) will explain his contribution (a new algorithm for the problem we are considering). The plan is that I will deliver about 40 minutes of the lecture, and he will take about 10 minutes.
Week 14 (4/23)
Using Big Data for Marine Species to Achieve Conservation Goals Rom Lipcius (marine science)

Colloquium Talks related to big data analysis and suitable for Undergraduate students: (normally Friday 2-2:50pm, Jones Hall 301)

1.  March 21, 2-3pm,  (location: TBA), Mathematics Colloquium: Donald E. Brown, University of Virginia

2. April 4, 2-3pm, Jones Hall 301, Mathematics Colloquium: Ana Moura, University of Aveiro, Portugal. Title:  A mixed integer programming model to solve the short sea shipping distribution problem




Adapted from this page,  modified by Junping Shi, Spring 2014.