Statistical Modeling of Large-Scale Scientific Simulation Data
Tina Eliassi-Rad
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
http://www.llnl.gov/CASC/people/eliassi-rad/
Simulations of complex scientific phenomena involve the execution of large-scale computer programs. These simulations generate massive data sets over the spatio-temporal space. Unfortunately, the sheer size of the data has made efficient exploration of them impossible. Therefore, constructing "queriable" models for such massive data sets is an essential step in helping scientists discover new information from their computer simulations.
At the Center for Applied Scientific Computing (CASC), we have developed
an ad-hoc query infrastructure, which reduces the data storage
requirements and query access times by (1) creating and storing
statistical and mathematical models of the data at multiple resolutions,
and (2) evaluating queries on the models of the data instead of the entire
data. In this talk, I focus on the modeling side of our infrastructure.
In particular, I present three simple but effective statistical modeling
techniques for simulation data. The first modeling technique computes the
"true" (unbiased) mean of systematic partitions of the data. It makes no
assumptions about the distribution of the data and uses a variant of the
root mean square error to evaluate a model. The second statistical
modeling technique uses the Andersen-Darling goodness-of-fit method on
systematic partitions of the data. This method evaluates a model by how
well it passes the normality test on the data. Finally, the third
statistical modeling technique is a multivariate clustering algorithm,
which utilizes the cosine similarity measure to cluster the field
variables in a data set. To scale our multivariate clustering algorithm
for large-scale data sets, we take advantage of the geometrical properties
of the cosine similarity measure. This allows us to reduce the modeling
time from O(n*n) to O(f(u)*n), where n is the number of data points and
f(u) is a function of the user-defined clustering threshold. We show that
on average f(u) is less than n. Our experimental evaluations on several
scientific data sets illustrate the value of using these statistical
modeling techniques on multiple resolutions of large-scale simulation data
sets.
Date: Thursday, March 6 |
Time: 4:15-5:30PM |
Place: Cordura 100 |
Return to the seminar schedule