Survival Analysis using R: A Student and Data Analysts Take


A Data Science Students Take on Survival Analysis

By: Fahim Karim

Survival analysis is something I learned while working as a Data Scientist for my professor, Dr.Sauleh Siddiqui, one of the world’s most eminent environmentalists and mathematicians.

Instead of telling you what it is, let me show you what it can do: it can predict which drug will have the best chance of approval through all phases at the FDA ( Food and Drug Administration). The way they do it is by crunching historical data.

Using time and event (Passing a certain FDA Phase) as dependent variables, they run calculations and see which drug had the best chance of survival or death. Here, death means the occurrence of the event (Passing a certain FDA Phase). If the drug keeps on surviving, it means it is stuck at the FDA without getting any approval from those famous FDA phases and the pharmaceutical firm manufacturing the drug is bleeding money. That’s bad for business.

As one digs deeper into the Survival model, we can see what kind of drug “dies” and what features or independent variables make them tick. Pharmaceutical firms then can try to replicate those certain features as much as possible before they try to manufacture a drug and send it to the FDA for approval. There will be a better chance it will make it to the market and bring in the Pharmaceutical firms some serious profit.

The beauty of the Survival model is that it can be used in numerous scenarios. Orphan Drugs, are such a sector in the healthcare industry. Using the historical data of patients, we can easily figure out if a patient/population will be an appropriate candidate for certain orphan drugs. The official Google definition of an Orphan Drug is a “pharmaceutical that remains commercially undeveloped owing to limited potential for profitability.” Pharmaceutical firms, however, will produce and market them if they know where they can find their customers. And without these unique drugs, patients with rare diseases are dying. This is a very compelling reason to produce such drugs.

Not all Survival analysis is rooted in the healthcare industry. My last example of where it can be used is in supply chain, specifically the international export of soybeans by the United States. The two components here which will be our dependent variable will be Sold (a simple Yes/No or 1/0) and the Duration.

If one can easily gather the historical data of the features like farmer, fertilizer type, zip code, seed type, waterway, barge, train/railway used, grain elevator used, oilseed crusher used, weather reports, seaports, cargo ships, and any more categories we can think might be of value and try to find the best way soybeans can be grown, transported and sold, we could allocate funds for infrastructures that would best help us fine-tune the entire soybean industry.

Survival mode is not taught in most courses in graduate school. Academia might think this is an exclusive model for analysis that only the Pharmaceutical industry can make the best use of. I would appreciate it if they taught this in my class in graduate school. I would appreciate it more if most of the papers that taught me what Survival Analysis was were more recent and from an eclectic background. I think that the Survival analysis has a lot of merit and it can be implemented in a plethora of industries, scenarios, and solutions.

I wish the papers about it were more fun!


A Freya Data Scientist’s Overview

By: Chris MacNeel

After reading Fahim Karim’s blog, I would say I have to agree with him. Survival analysis is extremely important in many different areas, especially when assessing the reliability and maintainability of components which is one of Freya’s main analytical focal points. With a better academic background in survival analysis, students who start careers in the industry will be more equipped to handle these problems and will integrate better within engineering communities that employ this methodology.

Survival Analysis in Component Reliability

Freya’s exposure to component reliability revolves around the aviation maintenance world, but this concept is easily extended to any sort of complex machinery that has many integrated components, such as cars, construction equipment, and manufacturing processes that rely on large industrial equipment like pumps, blowers, etc. Everything we interact with on a daily basis (computers, TVs, cars, kitchen appliances, smartphones, etc) has a life expectancy and failure rate. Understanding these critical metrics can help companies set warranty expectations, introduce better products, troubleshoot manufacturing defects, and ultimately save company money. In large industrial, manufacturing, and aviation processes, equipment uptime is critical. When an aircraft or large piece of complex machinery goes down it incurs costs, there are schedules that can be impacted, products that could potentially turn into waste, and the cost of troubleshooting and fixing a component.

In the aviation industry, there is a goal of properly balancing cost and reliability. An airline or military aircraft manufacturer could promise extremely high up times if billions of dollars were spent stocking copious amounts of components but that would be extremely wasteful and inefficient. On the flip side, you could be only reactionary and never have any safety supply but your reliability for an aircraft will be extremely low. All of this is the same whether you’re owning and operating the aircraft, or responsible for manufacturing the components for the aircraft as they too need to understand when it makes sense to implement new upgrades for components and what makes the most sense for component reliability vs investment in component upgrades.

One of the key aspects of Survival Analysis that we use is the Weibull Distribution. The Weibull Distribution is at the heart of reliability analysis and is used to model expected failure times. It can identify what the likelihood of a failure is for a given component with a probability. We use it so much that one of our team members created a blog about how we utilize it ( There are of course other features that might influence these failures such as areas of operations, how the aircraft is being flown, etc. All of these can be modeled effectively with enough historical data and these models will help in many areas of operations, logistics, and manufacturing.

Some examples might be:

  • Understanding the impact and potential benefits of changing out parts early to keep a machine running longer and proactively changing out a part that hasn’t failed yet but is near it’s end of life
  • Identifying the optimal number of components to take on deployment to reduce the probability of aircraft downtime needing spares that aren’t easily accessible in remote locations
  • Understanding the effects of extending a component’s life by introducing an upgrade and how much the cost of the upgrade would be, but how much it would save in uptime of the larger asset.

Many engineers we have interacted with utilize Survival Analysis and analysis around the Weibull Distribution whether they know it or not. With a strong background in Survival Analysis you will not only be able to tackle common problems in reliability but you will be able to assimilate within engineering groups and be able to work together and get help more easily because you are able to use common terms and techniques that engineers also employ.

Relevant stuff

This is a handbook on reliability that uses survival analysis this is like an age-old kind of area of statistics that has been used for a very long time.