

REVIEW ARTICLE 

Year : 2016  Volume
: 2
 Issue : 2  Page : 7379 

Survival analysis: A brief note
Nidhi Dwivedi, Sandeep Sachdeva
Department of Community Medicine, North DMC Medical College and Hindu Rao Hospital, New Delhi, India
Date of Submission  01Sep2016 
Date of Acceptance  25Oct2016 
Date of Web Publication  13Jan2017 
Correspondence Address: Sandeep Sachdeva Department of Community Medicine, North DMC Medical College and Hindu Rao Hospital, New Delhi  110 007 India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/24553069.198374
This manuscript briefly describes the concept and terminologies related to survival analysis, characteristics, need, data mechanism, functions, and application in health sciences along with different estimation procedure. Keywords: Cox proportional hazard,life expectancy,lost to followup,mortality, truncation
How to cite this article: Dwivedi N, Sachdeva S. Survival analysis: A brief note. J Curr Res Sci Med 2016;2:739 
Introduction   
In medical research, it is a question of interest that  “What is the survival pattern of patient suffering from a particular disease before or after treatment;” what is the survival pattern of group treated by different treatment methods? Researcher could also be interested in to find out the effect of explanatory variable such as age, sex, economic status of patients, and/or some other fixed or timevarying variables on the survival time, for example, “does smoking decreases lifespan?” Does good eating habit and exercise increases the incubation period of a disease? Some other specific applications of survival analysis are:^{[1]}
 Remission duration in a clinical trial for acute leukemia
 “Time” of infection in kidney dialysis patient
 “Time” of death for breast cancer trial
 “Time” of infection among burn patients
 “Time” of death among kidney transplant patients
 “Time” of death among patients with cancer.
Similar types of questions emanate from nonhealth sector also such as  Which manufacturing process increases the lifespan of a light bulb? What drives the duration of an individual's unemployment status; the period of strike, or the time duration of a recession? Statistical methods designed to describe, explain, or predict the occurrence of such type of events is called as survival analysis.^{[2],[3],[4]} In this manuscript, we briefly describe the concept of survival analysis, applications, characteristics, need, functions, and different estimation procedure of survival time.
What Is Survival Analysis?   
Survival analysis is a method for analyzing data which is in the form of “times,” that is, from a welldefined time origin till the occurrence of some particular event or endpoint. This type of data is called as lifetime, failure time, or survival data.^{[5]} In medical research, the time origin corresponds to recruitment of subjects into an experimental study such as a clinical trial which, in turn, may coincide with diagnosis of a particular condition, the commencement of a treatment regimen, or the occurrence of some adverse event and “times” has nothing to do with “frequency” per se. If the endpoint is death of a patient, the resulting data are literally referred to as survival time. Survival data include a response variable that measures the duration of time until occurrence of a specific event (event time, failure time, or survival time) and possibly a set of independent variables thought to be associated with the event time variable.^{[6]} Theses independent variables can be either discrete, such as sex, race or continuous such as age or temperature.
Characteristics of Survival Data   
The characteristics of survival data include:^{[7],[8],[9],[10]}
 The time origin should precisely be defined for each individual (e.g., birth, time since entry into study)
 The end event (failure) must be clearly defined (e.g., death, relapse, adverse drug reaction, development of new disease, recovery or heart attack, AIDS, etc.)
 All individuals should be as comparable as possible at their time of origin (e.g., to determine the incidence of death due to breast cancer among breast cancer patients, every patient will be followed from a baseline date such as date of diagnosis or date of surgery until the date of death or termination of study)
 Survival data can never be negative as they are the response time. Time is a positive realvalue variable that may be in hours, days, weeks, months, or years from beginning until an event occurs, for example, time from diagnosis to fullblown disease; time to death; length of stay in a jail/hospital/school; and HIV viral load measurement.
Goals of Survival Analysis   
 Estimate and interpret survival and/or hazard functions from survival data such as time until second heart attack for a group of myocardial infarction (MI) patients
 Compare survival and/or hazard functions such as treated versus placebo MI patients in a randomized controlled trial
 Assess the relationship of explanatory variables to survival time, for example, does weight, insulin resistance, or cholesterol influence survival time of MI patients.^{[11]}
Data Mechanism   
In general, survival data are not completely known. The incomplete nature of survival data depends on whether incomplete information is due to restriction imposed by the researcher or due to any random reason. There are two data mechanismcensoring and truncation.
Censoring
In some settings, it is difficult to predict when the event will occur because the chronological timeframe needed to observe an event in lifetime of all subjects in a study population may practically be very large thus preventing full observation. This leads to a concept termed as “censoring.”^{[8]} As the observations are lifetimes in nature which can be indefinitely long we use censoring to reduce time required for data collection.^{[11]} For example, clinical trials are conducted over a finite period with staggered entry of patients, that is, patient enters a clinical trial over time, and thus, the length of followup varies for each individual, consequently, the time to event may not be ascertained on all patients in the limited study period.
In addition, some of the participants may be lost to followup (e.g., moveon or refuse to continue in study) before termination of the study. For unbiased analysis of survival curves, it is essential that censoring due to loss to followup should be minimal and truly “noninformative, “ that is, participants who drop out of the study should do so due to reasons unrelated to the study. Informative censoring occurs when participants are lost to followup due to reasons related to the study. Several methods have been described to deal with the problem of informative censoring.^{[12]} These include imputation techniques for missing data, sensitivity analyses to mimic best and worstcase scenarios and use of the dropout event as a study endpoint.
Censored observations can arise in three ways:^{[13]}
 Event of interest does not occur during the study period
 Situation arising due to losttofollowup
 Subject has died of some cause totally unrelated to disease in question.
Types of censoring
There are three main types of censoring: (a) rightcensoring; (b) leftcensoring; and (c) interval censoring.^{[14]}
Rightcensoring
It is obvious that not all subjects will experience the event of interest during the observation period. In such a case, we only know that an individual has not experienced the event before the end of the study. This is referred to as rightcensoring. Three common situations where an individual's survival time is right censored are: (1) subject is lost to followup during the study, (2) a subject does not experience the event before the study ends, or (3) subject withdraws from the study. Rightcensoring includes Type I, Type II, and random censoring schemes.^{[15]}
Type I censoring
In this type of censoring scheme, the observational period is fixed. At the end of the study, any subject who has not yet failed is “censored.” In Type I censoring if there is no accidental loss, all censored observations equal the length of the study. Type I censoring is also known as “Time censoring.” Here, the number of items that fails before preassigned time t_{0} is a random variable. If we put n items on testing and say m items fails before t_{0} then (nm) items that survived beyond t_{0} are included in the timecensored sample, for example, suppose that six rats (named, A, B, C, D, E, and F) have been exposed to carcinogens by injecting tumor cells into their footpads and the time to develop tumor of a given size is observed [Figure 1].^{[6]} The investigator decides to terminate the experiment after 30 weeks of initiation of experiment, and it is observed that rats A, B, and D develop tumors after 10, 15, and 25 weeks, respectively. Rats C and E do not develop tumors by end of the study; their tumor free times are 30plus some weeks. Rat F died accidentally without any tumor after 19 weeks of observation. The survival data (tumor free times) are 10, 15, 30+, 25, 30+, and 19 weeks (The plus sign indicates a censored observation.^{[6]}
Type II censoring
Here, total number of failure is fixed in advance. We may put n items on test and terminate the experiment when preassigned number of items, say r (<n) have failed. The sample obtained through this mechanism is called “failurecensored” samples. As in above example, the investigator may decide to terminate the study after four of the six rats have developed tumors [Figure 2].^{[6]} Drawback of Type I and Type II censoring schemes is that they do not allow the discretion of researcher for removal of any individual at points other than the terminal point of experiment.^{[2]}
Random censoring
It is generalized Type I censoring wherein censoring time is random also known as Type III censoring. For example, in a medical trial, patients may enter the study in a more or less random fashion, according to their time of diagnosis [Figure 3].^{[6]} If the study is terminated at some prearranged date, then censoring times, that is the length of time from an individual's entry into the study until termination of the study is random.^{[6]}
Leftcensoring
Leftcensoring occurs when the actual survival time is less than what is observed by the investigator. This can occur when an event has occurred by the time of the first examination, and all that is known is an individual's survival time is less than a certain value. For example, if a researcher wants to study the duration between HIV infection and fullblown manifestation of AIDS but may involve a practical problem: HIV carriers can only enter to sample if they have been positively tested for HIV but it is difficult to find out when they acquired the infection. The problem of not knowing the exact point in time an individual has entered the state of interest is referred to as leftcensoring. That means in leftcensoring all that is known is that individual has experienced the event of interest before start of the study [Figure 4].^{[16]} It is to be noted that rightcensoring is very common in lifetime data, but leftcensoring is fairly rare.
Interval censoring
An observation is said to be intervalcensored if we know that the event occurs in a time interval (left, right), but we do not know exactly when is this interval [Figure 5].^{[16]} Intervalcensored data commonly arise in studies where there is a nonlethal endpoint, such as the recurrence of a disease or condition. A common example of intervalcensored survival data occur in studies that entail periodic followup such as asthma, and AIDS.
Truncation
Truncation is very similar to censoring but intuitively different. The incomplete nature of truncation is due to a systematic selection process inherent in the study design, for example, in experimental design, truncation means omitting all data outside a particular boundary. For illustrationresearcher wants to study the survival pattern of cancer patients aged 60 years and above in a certain area. Therefore, only those patients who are above sixty will be included in the study and others are excluded from the study; although, they are available for study. Similarly, if the researcher wants to find out the causes of anemia in children below age 15 years, then inclusion criteria to participate in the study would be patients below age 15 years only. It describes a sampling constraint that a failure time variable is observable only if it falls in certain region, say [Y_{L}, Y_{R}] as in above examples 60 and 15 are the defined boundaries. When the value of failure time falls outside the region, the information about the variable is completely lost and therefore excluded from the data set.^{[17]} Type of truncation depends on which limit of the considered interval is known. If we fix the lower limit of interval then it is said to be left truncated otherwise right truncated sample.^{[18]}
Functions of Survival Data   
The survival data are described or characterized by three functions:^{[19]}
 Survival function or survivorship function
 Probability density function
 Hazard function.
Survival Function
The survival function summarizes information from survival data by giving survival probabilities for different values of time. A survival probability is a probability that a person survives longer than specified time (say t) or the probability that an individual survives from the time origin to some time beyond “t.” Survival function is denoted by S (t) and is mathematically defined as:^{[20]}
 S (t) = P (an individual survives longer than t)
 = P (T > t).
Theoretically, all survival functions have the following characteristics [Figure 6]:
 As time t increases, S (t) decreases
 S (0) =1, since at the beginning of the study, no one has experienced an event, and the probability of surviving past time 0 is unity
 S (∞) =0 since if the study period were limitless, presumably everyone eventually would experience the event and the probability of surviving would ultimately fall to zero.
Probability density function
Probability density function is defined as the limit of the probability that an individual falls in the short interval per unit of time.
Hazard function
This is defined as the probability of failure during a very small time interval assuming that the individual has survived to the beginning of the interval. The hazard function is also known as the instantaneous failure rate or agespecific failure rate.
Life expectancy
The most popular measure of duration of survival is an expectation of life. The average number of years expected to be lived by individuals in the population is called life expectancy. The ideal method for computing the expectation of life is by observing a large cohort of live births as long as any individual of the cohort is alive. It may take more than 100 years and is impractical. Hence, as a shortcut a life table is constructed which assume that the individuals at different ages are exposed to the current risk of mortality. Thus, the current agespecific death rates are used on a preassumed cohort. The average of life so obtained is the number of years a newborn is expected to live at the current level of mortality. This is more useful because it tells about the existing situation and can be computed immediately without the need to wait for 100 years.^{[15]}
Estimation of Survival Time   
There are three approaches to estimate survival time for underlying observed dataparametric, semiparametric, and nonparametric approach.^{[9],[18],[20]} [Table 1] depicts a summary of these approaches.
Parametric approach
In parametric survival analysis, all parts of model are specified both the hazard function and the effect of any covariates. The strength of this approach is that estimation is easier and estimated survival curve is smoother as it draws information from the whole data. In addition, it is possible to do more sophisticated analyses with parametric models, such as including random effects or using Bayesian methodology to pool sources of information. The main drawback of parametric methods is that they require extra assumptions that may not be always appropriate.^{[20]}
Parametric methods assume that underlying distribution of the survival times follows certain known probability distributions. The popular ones include exponential, Weibull, and Lognormal distributions. Hence, in parametric approach researcher has some structural equation model. For example, parametric approach is possible to investigate potential causal pathways for inequalities in cancer survival. Researcher can select the probability distribution according to the nature of survival.^{[20]}
Semiparametric approach
Semiparametric method of analysis is used for multivariate survival data when some of subjects are related such as family studies or litter matched animal studies. It is also used when subject may experience two or more type of events in tandem, as in infectious disease such as AIDS, where each subject may first have event of HIV infection and then may get the occurrence of clinical AIDS.^{[19]} In semiparametric survival analysis, only some parts of the model, for the survival time “T” are specified.^{[21]} One of the most popular semiparametric approaches for survival analysis is coxproportional hazard model.
Coxproportional hazard model
This model is used to identify impact of different variables with a focus to identify crucial factor for handling disease. In medical scenario, thrust is on to finding out the cause or the other characteristic of a disease, for example, a patient suffering from heart attack, has disease of high blood pressure or has family history of heart problem. In general, regression analysis is used for this purpose, but due to the presence of censored data, ordinary regression techniques cannot be used. Therefore, cox's proportional hazards model is more appropriate in such situations.
Cox proportional hazard model is a statistical method that also determines a cumulative probability of an event but also accounts for impact of covariates on that probability.^{[22],[23]} If values of covariates changes with time then they are called timedependent covariates otherwise timeindependent covariates, for example, patient performance during the treatment period is timedependent and sex is time independent covariate. If timedependent covariates are involved, cox proportional hazards model cannot be used. More examples of timedependent covariates are cholesterol level of a patient changes during the study; regular examination of the patient, etc., In the case of timedependent covariates, analysis is performed using coxnonproportional hazard model.
Nonparametric approach
Parametric and semiparametric survival analysis requires some assumptions on the underlying distributions of observed data, which may not be appropriate in some situations. When there is not enough ground to make these assumptions, nonparametric models, also known as distributionfree models, could be an appropriate alternative. Most popular nonparametric method is Kaplan–Meier (KM) method.^{[24]}
Kaplan–Meier analysis
KM method was derived by KM in 1958 as a method to analyze censored data by direct generalization of the censored survival function.^{[25]} The KM survival curve is defined as the probability of surviving in a given length of time while considering time in many small intervals.
A KM analysis allows estimation of survival over time, even when patients drop out or are studied for a different length of time. KM analysis is used to estimate the probability that those who have survived at the beginning will survive to the end also. Thus, it is a conditional probability. It can be used to compare the survival rates of two or more groups of subjects and the analysis is expressed with respect to the proportion of the patients still alive after achieving the desired time limits following the entry or enrollment of subjects in the study.^{[26]}
[Figure 7] is an example of KM curve, which shows the survival curve for two different groups.^{[20]} The graph plotted between estimated survival probabilities (when data are continuous) or estimated survival percentages (when data are discrete) on Yaxis and time past after entry into the study (on Xaxis) includes horizontal and vertical lines. In the curve, length of horizontal line along Xaxis represents survival duration for that interval. The interval is terminated by the occurrence of the event of interest and the time of censored data is indicated by vertical lines.
KM method is a statistical treatment of survival times which not only makes proper allowances for those observations that are censored but also makes use of the information from these subjects till the time when they are censored.^{[27]} Only one event is measured in each time interval and event occur at the beginning of estimated time interval. Hence in a way it is a modified form of the “life table” technique.^{[28],[29]}
Conclusion   
Survival analysis is a tool for analyzing the time to event type data, especially in clinical trial. The primary goal is to estimate, interpret, and compare survival function and assess the relationship of explanatory variables with survival time.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Klein JP, Moeschberge ML, Gail M, Samet JM, Tsiats A. Statistics for Biology and Health. New York: Springer; 2003. 
2.  Balakrishnan N, Rao CR. Advance in survival analysis. New York: Elsevier; 2004. 
3.  
4.  
5.  
6.  
7.  
8.  Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer; 2003. 
9.  Nelson W. Applied Life Data Analysis. General Electric Company Corporate Research and Development. New York: John Wiley and Sons; 1982. 
10.  
11.  
12.  Shih W. Problems in dealing with missing data and informative censoring in clinical trials. Curr Control Trials Cardiovasc Med 2002;3:4. 
13.  
14.  Machin D, Campbell MJ, Walters SJ. Medical Statistics. 4 ^{th} ed. New York: John Wiley and Sons; 2007. 
15.  Indrayan A. Medical Biostatistics. 3 ^{rd} ed. New York: Chapman and Hall/CRC Press; 2013. 
16.  
17.  Balakrishnan N, Basu AP. The Exponential Distribution Theory, Methods and Applications. Netherland: Gordon and Breach Publishers; 1995. 
18.  
19.  Watanabe H. Applications of statistics to medical science, IV survival analysis. J Nippon Med Sch 2012;79:17681. 
20.  Sinha D, Dey DK. Semi parametric Bayesian analysis of survival data. J Am Stat Assoc 1997;92:1195212. 
21.  
22.  Huang J, Wellner JA. Interval Censored Survival Data: A Review of Recent Progress. Proceedings of First Seattle Symposium in Biostatistics. Lecture Notes in Statistics: 123. New York: Springer; 1997. 
23.  Singh R, Mukhopadhyay K. Survival analysis in clinical trials: Basics and must know areas. Perspect Clin Res 2011;2:1458. 
24.  
25.  Satagopan JM, BenPorat L, Berwick M, Robson M, Kutler D, Auerbach AD. A note on competing risks in survival data analysis. Br J Cancer 2004;91:122935. 
26.  Rich JT, Neely JG, Paniello RC, Voelker CC, Nussenbaum B, Wang EW. A practical guide to understanding KaplanMeier curves. Otolaryngol Head Neck Surg 2010;143:3316. 
27.  Altman DG. Analysis of Survival Times in Practical Statistics for Medical Research. London: Chapman and Hall; 1992. 
28.  
29.  Goel MK, Khanna P, Kishore J. Understanding survival analysis: KaplanMeier estimate. Int J Ayurveda Res 2010;1:2748. [ PUBMED] 
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7]
[Table 1]
