Seminar on Computational Learning and Adaptation


  Statistical Modeling of Large-Scale Scientific Simulation Data

Tina Eliassi-Rad
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
http://www.llnl.gov/CASC/people/eliassi-rad/

Simulations of complex scientific phenomena involve the execution of large-scale computer programs. These simulations generate massive data sets over the spatio-temporal space. Unfortunately, the sheer size of the data has made efficient exploration of them impossible. Therefore, constructing "queriable" models for such massive data sets is an essential step in helping scientists discover new information from their computer simulations.

At the Center for Applied Scientific Computing (CASC), we have developed an ad-hoc query infrastructure, which reduces the data storage requirements and query access times by (1) creating and storing statistical and mathematical models of the data at multiple resolutions, and (2) evaluating queries on the models of the data instead of the entire data. In this talk, I focus on the modeling side of our infrastructure. In particular, I present three simple but effective statistical modeling techniques for simulation data. The first modeling technique computes the "true" (unbiased) mean of systematic partitions of the data. It makes no assumptions about the distribution of the data and uses a variant of the root mean square error to evaluate a model. The second statistical modeling technique uses the Andersen-Darling goodness-of-fit method on systematic partitions of the data. This method evaluates a model by how well it passes the normality test on the data. Finally, the third statistical modeling technique is a multivariate clustering algorithm, which utilizes the cosine similarity measure to cluster the field variables in a data set. To scale our multivariate clustering algorithm for large-scale data sets, we take advantage of the geometrical properties of the cosine similarity measure. This allows us to reduce the modeling time from O(n*n) to O(f(u)*n), where n is the number of data points and f(u) is a function of the user-defined clustering threshold. We show that on average f(u) is less than n. Our experimental evaluations on several scientific data sets illustrate the value of using these statistical modeling techniques on multiple resolutions of large-scale simulation data sets.


Date: Thursday, March 6

Time: 4:15-5:30PM

Place: Cordura 100


Return to the seminar schedule