74
Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll The Interface of Functional and Longitudinal Data

Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational

Embed Size (px)

Citation preview

  • Slide 1

Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll The Interface of Functional and Longitudinal Data Slide 2 My Charge Please feel free to talk about anything you wish (Dangerous) Your thinking about longitudinal data and perhaps functional data from a wider perspective Goals of the workshop are to inspire new researchers, and to take stock of where the interface of longitudinal-functional data and dynamics is headed Slide 3 What I Want to Talk about Mother and joey, Tidbinbilla (outside Canberra), September 2010 Slide 4 What I Want to Talk about Namadji National Park July 2005 Slide 5 What I Will Talk About I will talk about some of the problems I have worked on No technical solutions, the other speakers look to be providing them Investigators think marginally, statisticians think of random effects Slide 6 Some Observations In my work, there is a tension between Providing answers to my collaborators that they can understand Developing new general methodology publishable in statistics and that can solve more general problems Thinking about parts of the actual problem that my collaborators would not have thought about Its easy to get caught up in either of the 1 st two Slide 7 Some Observations When I am simply providing answers to stated questions, I find similar themes as the distinction between marginal models such as GEE and nonlinear mixed effects models for longitudinal data GEE is simply easier Most scientists think marginally because they are uncomfortable with the idea of variability Slide 8 What I Will Talk About Think what the typical smart biologist knows about statistics. t-tests, ANOVA, simple linear regression All the focus is on the mean, none on the variability Slide 9 Some Observations What we have to do is to deliver the analysis the data collectors can understand, and teach them about variability Pictures work wonders: functions are no harder to understand than histograms, and understanding variability can help investigators tell stories Slide 10 Some Observations We need to advance the field of statistics Deeper understanding of the underlying process, through random effects modeling, often helps inform future studies and helps investigators tell their story Slide 11 An Old Colon Carcinogenesis Project Experiment with 2 lipids (fish oil and corn oil) with and without butyrate (a fatty acid) supplementation, with p27 or MGMT repair measured as the response Longitudinal, maybe even dynamic, hierarchical and functional. Hierarchical because each of the treatment groups has multiple samples, and each of them have multiple functions Functional because of the biology Slide 12 Colon Cancer Data Jeff Morris Ciprian Crainiceanu Ana-Maria Staicu Naisyin Wang Veera B Yehua Li Slide 13 Functional The colonic crypts have cells, near the bottom (x=0) are the stem cells, near the top (x=1) are the differentiated cells Slide 14 MGMT Repair Enzyme, 1 crypt MGMT curve in one crypt. Original analysis found large diet effects Slide 15 MGMT Repair Enzyme, 1 crypt The large diet effects on the MGMT repair enzyme are real. There are also large diet effects on apoptosis Slide 16 MGMT Repair Enzyme, 1 crypt What do biologists do (define original analysis)? They simplify the data so that they can do ANOVA, duh! They average all the response (p27 or MGMT, about 200 observations in each analysis) in the bottom 1/3 rd, Middle 1/3 rd and top 1/3 rd. Then they run 3 ANOVA. Slide 17 MGMT Repair Enzyme, 1 crypt They then they tell a story about all the ANOVA they have done. We all smile about this, but my collaborator (Joanne Lupton) just got elected into the U. S. National Academy of Science. Slide 18 MGMT Repair Enzyme, 1 crypt I like to think that our more nuanced analyses help her tell her stories, which is hopefully not wishful thinking! Slide 19 MGMT Repair Enzyme, 1 crypt Wavelet functional coefficients for apoptotic index in the top 1/3 of the crypt, for fish oil and for corn oil. From Morris and Carroll (2006): fish-oil-fed animals who had a large amount of apoptosis near their lumenal surface also had high levels of the DNA repair enzyme MGMT near their lumenal surface, meaning that the two major mechanisms for dealing with DNA damage were correlated. This relationship was not so strong for corn-oil-fed animals. Slide 20 MGMT Repair Enzyme, the stiry We did a full-blown wavelet-based functional mixed model analysis to get these conclusions. Could it have been done marginally? Probably Yes, but then thats dull. However, we (a) know much more about the pattern of variability and (b) we built up methods and software that can be used in a wide variety of settings Slide 21 Longitudinal Colon carcinogenesis is a localized phenomenon. The crypts closest to one another are highly correlated Slide 22 Colon Cancer Data The locality hypothesis says that colon cancer starts because of highly localized damage. Longitudinal and hierarchical FDA can tell us many things about this hypothesis, e.g., where is localized damage more likely to occur? While most research focuses on the proximal and distal portions of the colon, FDA reveals that there is as much or more in the middle Slide 23 Basic Model for p27 Slide 24 Colon Cancer Data Lots of fun fitting this longitudinal, hierarchical functional data set What did the investigators want to know? They were interested in how correlated neighboring crypts are, consistent with the locality hypothesis. Slide 25 Colon Cancer Data The Bayesian analysis gives them strong point-wise evidence (can supplement with FDR) Allows summary measures Slide 26 Colon Cancer Data Acknowledging the longitudinal nature led to much more precise inferences. This is the interaction function between diet and treatment: guess which one allows for locality? Slide 27 Cell Signaling Data Myometrial cells meant to mimic what goes on near birth were either exposed to dioxin (TCDD) or not exposed. They were then exposed to a hormone, oxytocin, that stimulates calcium ion signaling (CA 2+ ) The CA 2+ signal was observed at many pixels of each cell for 512 time points (85 minutes) Slide 28 Cell Signaling Data Josue Martinez Jianhua Huang Slide 29 Cell Signaling Data The cells were segmented, and intensity of the signals were obtained for each pixel, each cell and all time points. Roughly 25 cells in each treatment group (control and TCDD) Hierarchical because of pixels within cells within treatments Slide 30 Cell Signaling Data Functional because pixels are measured over time Possibly different levels of spatial because the cells are in spatial alignment Lots of preprocessing: cell segmentation, adjustment for saturation, and more Slide 31 Cell Signaling Data First two minutes of the experiment for the TCDD treated plate. Next comes two movies of the data Slide 32 Cell Signaling Data All cells (Control and TCDD), at a basal state in which the cells were cultured, 0-4 minutes and 40-80 minutes after oxytocin exposure Slide 33 Cell Signaling Data All cells (Control and TCDD), at a low estrogen state, just before pregnancy (note the delayed response due to TCDD) Slide 34 Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy Slide 35 Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy, after normalization and registration Slide 36 Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy, after normalization and registration. Areas under the curve (p < 0.001) Slide 37 Cell Signaling Data You should see that in this analysis, we have not made use of the structure of the data. We have thought like GEE people, and indeed reduced the comparison of control and TCDD to single numbers, e.g., peak time and area under the curve. We did lots of dimension reduction (4 weighted SVD) to get here Slide 38 Cell Signaling Data There was a lot of work to get the data into a format for analysis Question: what can hierarchical, possible spatial FDA do for us here, and given the structure, how should an analysis proceed? I feel that there is a lot more that we can learn about the process by thinking more deeply about the modeling Slide 39 Bat Chirp Data Bats of the same species, residing in Austin (city bats) and College Station (Aggie bats) Slide 40 Bat Chirp Data Josue Martinez Jeff Morris Slide 41 Bat Chirp Data Slide 42 Bat chirps were recorded, some multiple times for each bat. The hierarchy is species, bat, replicate I believe this analysis is a poster child for why to think functionally and hierarchically Slide 43 A Representative Bat Chirp Slide 44 Bat Chirp Spectrogam Slide 45 Bat Chirp Data The chirp is mainly composed of frequencies that start at about 40 kilohertz (kHz) and slowly decrease to 20 kHz from 0 to 8 milliseconds into the chirp. The bat then transitions to predominant frequencies at 60 kHz that slowly decrease back down to 40 kHz and then rise up to 60 kHz towards the end of the chirp. Frequencies above 80 kHz are harmonics of the fundamental signal. Slide 46 One Chirp per Bat Slide 47 Bat Chirp Data It seems clear to me that this is an inherently functional problem. Trying to reduce it to a single number to do a t- test seems difficult to contemplate, but it is not impossible. People have tried t-tests and classification based on measures such as duration, start frequency, end frequency, etc. Slide 48 Bat Chirp Data One could simply take each pixel of the spectrogram and do t-tests, with FDR control This would ignore the replicate data, would ignore the correlated nature of the data, would do no dimension reduction, etc. What did the biologist want to know? Kisi Bohn Slide 49 Bat Chirp Data She wanted to know if the bats from the same species (City Bats and Aggie Bats) evolved and have different vocalizations What did we want to do: Answer her question precisely, and let her tell a story (the marginal question, imprecisely framed) Use all the data Understand the variability Slide 50 Bat Chirp Data We wavelet transformed the spectrograms, fit a 2-D hierarchical WFFM, transformed back, and did analysis of the results (see next) Slide 51 Bat Chirp Data Difference in mean spectrogram inferred from model. Red favors College Station, Blue favors Austin This could be done without random effects Slide 52 Bat Chirp Data White areas are those in which the spectrograms differ by 1.5 fold or more, with a global FDR control of 15%. Hard to do legitimately without random effects? Slide 53 Frequency Agile Lidar Data This is a recent project from Bani Mallicks group Here is a comic describing the process Slide 54 LIDAR Data Bani Mallick and his student Swarup De Peter Hall and Aurore Delaigle Slide 55 Frequency Agile Lidar Data There is a transmitted signal There is background There is a received signal, which is then background corrected For each time (100+) and wavelength (19), we see 625 observations across the physical range of observation, i.e., equally spaced functional data with noise. Slide 56 Frequency Agile Lidar Data Slide 57 There are two types of signals The first is benign, ordinary dust that has been released The second are biological aerosols Slide 58 Frequency Agile Lidar Data Slide 59 Four samples at same time and wavelength. Background corrected only Slide 60 Frequency Agile Lidar Data Four samples and same time and wavelength. Background corrected, truncated at zero and normalized Slide 61 Data For aerosol type a = 1,2, and sample i=1,,n within type, we observe background corrected received data Here t = time, w = wavelength and x = distance. This is hierarchical: there are samples within types Slide 62 Data For aerosol type a = 1,2, and sample i=1,,n within type, we observe This is functional: there are bivariate space-time curves over distance x and time t It is longitudinal, over wavelength Slide 63 Approaches For aerosol type a = 1,2, and sample i=1,,n within type, we observe There are a vast number of approaches possible The fun thing to do is to build a hierarchical, longitudinal, space-time model Doing this is not trivial, will advance the field, will allow sharing of data, will allow understanding of variability, etc. Slide 64 Approaches The investigators want things far more boring They want to know if there are differences between the two types of samples (biological and non-biological), sigh. Slide 65 Approaches Both simple questions can be handled by a model-based approach, of course. But they can also be answered by much simpler, ad hoc, dimension reduction-based and not particularly innovative approaches We will have to decide what to do! Slide 66 Conclusions Functional, hierarchical and longitudinal data are the wave of the future. I have given 4 examples of functional data that are either hierarchical or longitudinal Analyzing data like this is great fun! Slide 67 Conclusions The questions I have raised are about the goals of such studies. If investigators only think marginally, they miss out. If we do not think marginally, we have less influence Slide 68 Conclusions Marginal approaches are often much faster to implement, and easier to explain. Id like speaker at this conference to help me by indicating why powerful random effects models are better than marginal approaches. Slide 69 Advertisement TAMU has an full professor opening in computational statistics as broadly defined. Startup funding is at least $750,000 Slide 70 Other Acknowledgments I gratefully acknowledge financial support from the U. S. National Cancer Institute (R37- CA057030) and King Abdullah University of Science and Technology (KAUST, Award Number KUS-CI-016-04). Slide 71 Approaches There is a deconvolution aspect to this problem that is fairly unique Along with the received signal, there is a transmitted signal There is thought to be a true signal Slide 72 Approaches The deconvolution equation is Here, is supposed to be white noise over x Should one use or ? Slide 73 Approaches It turns out that there are no systematic differences across treatment for or for So differences across treatments in the received signal reflect differences in the true signal, and vice-versa Is deconvolution a good idea? It is a heck of a lot of work, and the model assumptions are stringent Slide 74 Approaches We think deconvolution here is not only harder than simply using the observed data, but less efficient because of the excess noise induced by deconvolution The Mallick group has made great progress on attacking this in a systematic, functional, hierarchical, Bayesian manner