Survival Analysis for C-MAPSS turbofan jet engines ~ predictive maintenance

18 September 2022 - 85 mins read time
Tags: Survival Analysis Python Time Series predictive maintenance

Data Set FD002 of the NASA C-MAPSS multivariate timeseries data set is about a set of identical turbofan jet engines all showing different degrees of initial wear and manufacturing variations as well as operational settings. Data Set Information

This project is about applying the statistical methods of Survival Analysis to analyze the length of time required until the occurrence of an event. Survival Analysis is often referred to as time-to-event analysis. Through this survival analysis method, we can plan well or take steps to anticipate an unwanted event. Survival analysis was first developed in the field of medicine to see the effect of a medical treatment on the patient’s survival until death. In addition to the application in the field of medicine, this method can be applied to the following fields or industries:

Manufacturing: Predicting the duration of a machine running until it breaks down.
Marketing: Predict the duration of a customer’s subscription to unsubscribe.
Human Resources: Predict the duration of an employee’s employment until resignation.

In Survival Analysis we may observe entities of interest that started a process or came into existance before we start our data gathering or in contrast have not yet finished the process or at least the anticipated time to event has not passed at the time of our end of observation. Hence these censoring cases can be distinguished:

Not censored:
1. If the event occurred AND
2. the survival duration is known.
Right censored: (most common type)
1. If the event did not occur in observation time OR
2. the true duration is longer than our observation.
Left censored:
1. If the event has happened AND
2. we know the true duration is shorter than our observation time.
Interval censored:
1. The event is observed, but individuals come in and out of observation, so the exact event times and survival durations are unknown, e.g. due to late discovery in periodical maintenance

These definitions are often misunderstood especially in context of graphical timeline representations, hence I kept the conditions above exact and concise and omitted visualization on purpose.

Due to censored observations the use of ordinary regression models is not an option. We will use survival analysis methods which can be used with censored data.

Mathematical representation

Suppose $T$ is a random variable that expresses the time until the occurrence of an event. Then

Probability Density Function (PDF) $f(t) = P(T = t)$: the probability of an event occurring at time $t$
Cumulative Distribution Function* (CDF) $F(t) = P(T <= t)$: the probability of an event occurring before time $t$. Mathematically, CDF can be defined as follows: $F(t) = \int_{0}^{t}{f(s)} ds$
Survival Distribution Function* (SDF) $S(t) = P(T>t)$: the probability of an event occurring after time $t$. Mathematically, the SDF can be defined as follows: $ S(t) = 1 - F(t) $
Hazard Function (Hazard Rate Function/Force of Mortality) $h(t)$: the conditional probability of an event occurring at time $t$ knowing the observed subject has not experienced the event at time $t$. Mathematically, the hazard function can be obtained as follows: $ h(t) = - \frac{d}{dt} ln(S(t))$

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
sns.set(rc = {'figure.figsize':(16,9)})

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Import Data and Preprocessing

We are going to use the NASA turbofan jet engine data set which contains rows as time series data across the time cycles that the engines have performed. In each of these cycles, three types of operational settings were recorded as wells as 22 different measurements through sensors.

columns = ["machine_name", "cycle", "operational_setting_1", "operational_setting_2", "operational_setting_3"] + [f'sensor_measurement_{i:02}' for i in range(1,22)]

# Read data
turbofan_df = pd.read_csv("train_FD002.txt", header = None, sep = "\s+", names = columns)
turbofan_df_test = pd.read_csv("test_FD002.txt", header = None, sep = "\s+", names = columns)
turbofan_df.head()

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
0	1	1	34.9983	0.8400	100.0	449.44	555.32	1358.61	1137.23	5.48	...	183.06	2387.72	8048.56	9.3461	0.02	334	2223	100.00	14.73	8.8071
1	1	2	41.9982	0.8408	100.0	445.00	549.90	1353.22	1125.78	3.91	...	130.42	2387.66	8072.30	9.3774	0.02	330	2212	100.00	10.41	6.2665
2	1	3	24.9988	0.6218	60.0	462.54	537.31	1256.76	1047.45	7.05	...	164.22	2028.03	7864.87	10.8941	0.02	309	1915	84.93	14.08	8.6723
3	1	4	42.0077	0.8416	100.0	445.00	549.51	1354.03	1126.38	3.91	...	130.72	2387.61	8068.66	9.3528	0.02	329	2212	100.00	10.59	6.4701
4	1	5	25.0005	0.6203	60.0	462.54	537.07	1257.71	1047.93	7.05	...	164.31	2028.00	7861.23	10.8963	0.02	309	1915	84.93	14.13	8.5286

5 rows × 26 columns

turbofan_df[turbofan_df.machine_name == 9
           ]

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
1513	9	1	35.0044	0.8408	100.0	449.44	555.34	1353.35	1127.38	5.48	...	182.78	2387.88	8058.44	9.3193	0.02	334	2223	100.0	15.04	8.8947
1514	9	2	10.0029	0.2500	100.0	489.05	604.82	1494.69	1311.21	10.52	...	372.01	2388.15	8120.88	8.6418	0.03	370	2319	100.0	28.31	17.1242
1515	9	3	10.0012	0.2508	100.0	489.05	604.39	1494.88	1304.80	10.52	...	371.81	2388.17	8120.16	8.6273	0.03	369	2319	100.0	28.59	17.0821
1516	9	4	10.0046	0.2505	100.0	489.05	604.52	1490.63	1311.28	10.52	...	371.83	2388.14	8116.98	8.6594	0.03	369	2319	100.0	28.51	17.0908
1517	9	5	0.0014	0.0006	100.0	518.67	642.30	1586.92	1412.84	14.62	...	522.19	2388.07	8130.64	8.4464	0.03	391	2388	100.0	38.99	23.4381
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1707	9	195	42.0064	0.8400	100.0	445.00	550.60	1374.56	1139.37	3.91	...	130.83	2387.37	8066.38	9.4609	0.02	332	2212	100.0	10.45	6.3269
1708	9	196	10.0023	0.2500	100.0	489.05	605.97	1517.57	1321.09	10.52	...	370.75	2388.38	8107.72	8.7508	0.03	373	2319	100.0	28.40	16.9684
1709	9	197	42.0077	0.8402	100.0	445.00	551.00	1368.98	1145.31	3.91	...	129.82	2387.39	8063.45	9.5044	0.02	333	2212	100.0	10.37	6.3352
1710	9	198	20.0015	0.7000	100.0	491.19	608.62	1499.60	1274.34	9.35	...	313.58	2388.16	8040.91	9.3245	0.03	371	2324	100.0	24.16	14.5259
1711	9	199	10.0047	0.2519	100.0	489.05	605.75	1515.93	1332.19	10.52	...	369.80	2388.42	8102.98	8.7351	0.03	373	2319	100.0	28.34	16.9781

199 rows × 26 columns

Create Censored Data

If we look at the data above, the data we have is not data that has right censored observations. To get these observations, we do these steps: Select the maximum cycles until machine failure for each machine and assume the end of observation time.

# backup timeseries data
timeseries_df = turbofan_df.copy()
timeseries_df_test = turbofan_df_test.copy()
timeseries_df.head()

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
0	1	1	34.9983	0.8400	100.0	449.44	555.32	1358.61	1137.23	5.48	...	183.06	2387.72	8048.56	9.3461	0.02	334	2223	100.00	14.73	8.8071
1	1	2	41.9982	0.8408	100.0	445.00	549.90	1353.22	1125.78	3.91	...	130.42	2387.66	8072.30	9.3774	0.02	330	2212	100.00	10.41	6.2665
2	1	3	24.9988	0.6218	60.0	462.54	537.31	1256.76	1047.45	7.05	...	164.22	2028.03	7864.87	10.8941	0.02	309	1915	84.93	14.08	8.6723
3	1	4	42.0077	0.8416	100.0	445.00	549.51	1354.03	1126.38	3.91	...	130.72	2387.61	8068.66	9.3528	0.02	329	2212	100.00	10.59	6.4701
4	1	5	25.0005	0.6203	60.0	462.54	537.07	1257.71	1047.93	7.05	...	164.31	2028.00	7861.23	10.8963	0.02	309	1915	84.93	14.13	8.5286

5 rows × 26 columns

# Select maximum cycle
max_cycle = turbofan_df.groupby(by = "machine_name")['cycle'].transform(max)
#max_cycle = turbofan_df.groupby(by = "machine_name")['cycle'].max()
max_cycle

      149
      149
      149
      149
      149
        ... 
  316
  316
  316
  316
  316
Name: cycle, Length: 53759, dtype: int64

turbofan_df = turbofan_df[turbofan_df["cycle"] == max_cycle].set_index('machine_name')
turbofan_df.head()

	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	sensor_measurement_06	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
machine_name
1	149	42.0017	0.8414	100.0	445.00	550.49	1366.01	1149.81	3.91	5.71	...	129.55	2387.40	8066.19	9.4765	0.02	332	2212	100.0	10.45	6.2285
2	269	42.0047	0.8411	100.0	445.00	550.11	1368.75	1146.65	3.91	5.72	...	129.76	2388.42	8110.26	9.4315	0.02	334	2212	100.0	10.56	6.2615
3	206	42.0073	0.8400	100.0	445.00	550.80	1356.97	1144.89	3.91	5.72	...	130.02	2387.87	8082.25	9.4962	0.02	333	2212	100.0	10.46	6.3349
4	235	0.0030	0.0007	100.0	518.67	643.68	1605.86	1428.21	14.62	21.61	...	520.25	2388.17	8215.14	8.5784	0.03	397	2388	100.0	38.47	22.9717
5	154	42.0049	0.8408	100.0	445.00	550.53	1364.82	1146.87	3.91	5.72	...	130.05	2389.19	8151.36	9.4339	0.02	333	2212	100.0	10.74	6.3906

5 rows × 25 columns

Next, we make a lollipop plot to obtain the distribution of maximum cycles on each machine.

# Lollipop plot for each machine name
plt.hlines(y=turbofan_df.index, xmin=1, xmax=turbofan_df['cycle'], color='darkgrey')
plt.plot(turbofan_df['cycle'], turbofan_df.index, "o", color='royalblue')
plt.plot([1 for i in range(len(turbofan_df))], turbofan_df.index, "o")

ylim = plt.gca().get_ylim()
plt.ylim(ylim[1], ylim[0])

# Add titles and axis names
plt.title("Cycle runtime of engine until failure event")
plt.xlabel("Cycle")
plt.ylabel('Machine ID')

# Show the plot
plt.show()

png

We assume that the observation time limit is 220 cycles so that when the machine is still active after 220 cycles then the machine will be considered right censored.

# Create status column
turbofan_df['status'] = turbofan_df['cycle'].apply(lambda x: False if x > 215 else True)

turbofan_df.head(5)

	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	sensor_measurement_06	...	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21	status
machine_name
1	149	42.0017	0.8414	100.0	445.00	550.49	1366.01	1149.81	3.91	5.71	...	2387.40	8066.19	9.4765	0.02	332	2212	100.0	10.45	6.2285	True
2	269	42.0047	0.8411	100.0	445.00	550.11	1368.75	1146.65	3.91	5.72	...	2388.42	8110.26	9.4315	0.02	334	2212	100.0	10.56	6.2615	False
3	206	42.0073	0.8400	100.0	445.00	550.80	1356.97	1144.89	3.91	5.72	...	2387.87	8082.25	9.4962	0.02	333	2212	100.0	10.46	6.3349	True
4	235	0.0030	0.0007	100.0	518.67	643.68	1605.86	1428.21	14.62	21.61	...	2388.17	8215.14	8.5784	0.03	397	2388	100.0	38.47	22.9717	False
5	154	42.0049	0.8408	100.0	445.00	550.53	1364.82	1146.87	3.91	5.72	...	2389.19	8151.36	9.4339	0.02	333	2212	100.0	10.74	6.3906	True

5 rows × 26 columns

Machine status True indicates that the machine has failed within the observation time span while False indicates the machine has not failed yet during the observation time span.

# Distribution of each status
turbofan_df['status'].value_counts()

True     173
False     87
Name: status, dtype: int64

Exploratory Data Analysis

The next step is to perform feature selection, which is the selection of variables as predictors that we want to include in the model.

turbofan_df.describe()

	cycle	operational_setting_1	operational_setting_2	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	sensor_measurement_06	sensor_measurement_07	...	sensor_measurement_11	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
count	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	...	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000	260.000000
mean	206.765385	27.937113	0.598383	466.689346	575.619115	1425.801000	1218.674692	7.124231	10.386385	255.637115	...	43.836038	240.824538	2364.920731	8105.652538	9.288340	349.815385	2242.073077	99.014654	18.644115	11.185072
std	46.782198	16.564435	0.324420	27.479073	36.085540	97.552983	114.934975	3.949434	5.864195	153.443600	...	2.682455	144.687933	89.267941	74.714146	0.584163	25.629762	108.889934	3.732538	10.443101	6.261529
min	128.000000	0.000100	0.000000	445.000000	536.620000	1263.290000	1058.960000	3.910000	5.710000	136.990000	...	37.210000	129.330000	2027.610000	7850.960000	8.486100	309.000000	1915.000000	84.930000	10.270000	6.164800
25%	174.000000	10.003225	0.250650	445.000000	550.477500	1364.650000	1143.535000	3.910000	5.720000	138.277500	...	42.650000	130.280000	2387.970000	8081.580000	8.753125	334.000000	2212.000000	100.000000	10.470000	6.283750
50%	199.000000	41.998300	0.840000	445.000000	550.880000	1370.270000	1147.880000	3.910000	5.720000	139.165000	...	42.770000	130.985000	2388.320000	8116.755000	9.414700	335.000000	2212.000000	100.000000	10.685000	6.390250
75%	230.250000	42.002625	0.840100	489.050000	606.042500	1512.122500	1329.327500	10.520000	15.500000	392.595000	...	46.070000	370.152500	2388.752500	8148.730000	9.456725	372.000000	2319.000000	100.000000	28.210000	16.925775
max	378.000000	42.007900	0.842000	518.670000	644.260000	1609.460000	1438.160000	14.620000	21.610000	551.980000	...	48.510000	520.250000	2390.390000	8252.610000	11.045400	398.000000	2388.000000	100.000000	38.680000	23.153900

8 rows × 23 columns

turbofan_df.iloc[:,8:].head()

	sensor_measurement_05	sensor_measurement_06	sensor_measurement_07	sensor_measurement_08	sensor_measurement_09	sensor_measurement_10	sensor_measurement_11	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21	status
machine_name
1	3.91	5.71	137.76	2211.38	8307.72	1.02	42.77	129.55	2387.40	8066.19	9.4765	0.02	332	2212	100.0	10.45	6.2285	True
2	3.91	5.72	138.61	2212.32	8358.80	1.02	42.64	129.76	2388.42	8110.26	9.4315	0.02	334	2212	100.0	10.56	6.2615	False
3	3.91	5.72	138.08	2211.81	8329.26	1.02	42.85	130.02	2387.87	8082.25	9.4962	0.02	333	2212	100.0	10.46	6.3349	True
4	14.62	21.61	551.49	2388.17	9150.14	1.30	48.25	520.25	2388.17	8215.14	8.5784	0.03	397	2388	100.0	38.47	22.9717	False
5	3.91	5.72	138.48	2212.99	8397.52	1.02	42.67	130.05	2389.19	8151.36	9.4339	0.02	333	2212	100.0	10.74	6.3906	True

Check Uniqueness

Pertama kita akan memeriksa banyak nilai unik pada setiap kolom lalu kolom yang memiliki nilai unik yang sedikit akan diganti dengan tipe data kategori.

turbofan_df.nunique()

cycle                    133
operational_setting_1    165
operational_setting_2     62
operational_setting_3      2
sensor_measurement_01      6
sensor_measurement_02    188
sensor_measurement_03    257
sensor_measurement_04    251
sensor_measurement_05      6
sensor_measurement_06      8
sensor_measurement_07    209
sensor_measurement_08    177
sensor_measurement_09    259
sensor_measurement_10      8
sensor_measurement_11    123
sensor_measurement_12    198
sensor_measurement_13    152
sensor_measurement_14    258
sensor_measurement_15    247
sensor_measurement_16      2
sensor_measurement_17     27
sensor_measurement_18      6
sensor_measurement_19      2
sensor_measurement_20    135
sensor_measurement_21    253
status                     2
dtype: int64

Columns that can be changed to category type: operational_setting_3, sensor_measurement_16

# Change to category
cat = ['operational_setting_3', 'sensor_measurement_16']

turbofan_df[cat] = turbofan_df[cat].astype('category')
turbofan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260 entries, 1 to 260
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 cycle                  260 non-null    int64   
 operational_setting_1  260 non-null    float64 
 operational_setting_2  260 non-null    float64 
 operational_setting_3  260 non-null    category
 sensor_measurement_01  260 non-null    float64 
 sensor_measurement_02  260 non-null    float64 
 sensor_measurement_03  260 non-null    float64 
 sensor_measurement_04  260 non-null    float64 
 sensor_measurement_05  260 non-null    float64 
 sensor_measurement_06  260 non-null    float64 
sensor_measurement_07  260 non-null    float64 
sensor_measurement_08  260 non-null    float64 
sensor_measurement_09  260 non-null    float64 
sensor_measurement_10  260 non-null    float64 
sensor_measurement_11  260 non-null    float64 
sensor_measurement_12  260 non-null    float64 
sensor_measurement_13  260 non-null    float64 
sensor_measurement_14  260 non-null    float64 
sensor_measurement_15  260 non-null    float64 
sensor_measurement_16  260 non-null    category
sensor_measurement_17  260 non-null    int64   
sensor_measurement_18  260 non-null    int64   
sensor_measurement_19  260 non-null    float64 
sensor_measurement_20  260 non-null    float64 
sensor_measurement_21  260 non-null    float64 
status                 260 non-null    bool    
dtypes: bool(1), category(2), float64(20), int64(3)
memory usage: 57.9 KB

Correlation Heatmap

We can check the correlation between columns of candidate predictors to see if there is a high correlation between predictors. Variables that have a high correlation with other variables need to be selected to avoid multicollinearity.

sns.heatmap(turbofan_df.corr()**2, annot = True,)

<AxesSubplot:>

png

In the plot above, the correlation between variables looks quite high. In this case we will try to first select the columns sensor_measurement_04, sensor_measurement_08, sensor_measurement_11, sensor_measurement_14.

selected_columns = ['sensor_measurement_04','sensor_measurement_11','sensor_measurement_14']
cleaned_data = turbofan_df.loc[:, selected_columns + cat + ['status', 'cycle']]
sns.heatmap(cleaned_data.corr(), annot = True)

<AxesSubplot:>

png

Data Preparation for Modeling

The next step is to prepare the data for modeling. Things that need to be considered are:

For columns with category type, One Hot Encoding needs to be done.
For the target data, it needs to be written in the form of an array with each element is a tuple consisting of the machine status (True/False) and the time/cycle

# One Hot Encoding for Categorical Variable
from sksurv.preprocessing import OneHotEncoder

data_x = OneHotEncoder().fit_transform(cleaned_data.iloc[:, :-2])
data_x.head()

	sensor_measurement_04	sensor_measurement_11	sensor_measurement_14	operational_setting_3=100.0	sensor_measurement_16=0.03
machine_name
1	1149.81	42.77	8066.19	1.0	0.0
2	1146.65	42.64	8110.26	1.0	0.0
3	1144.89	42.85	8082.25	1.0	0.0
4	1428.21	48.25	8215.14	1.0	1.0
5	1146.87	42.67	8151.36	1.0	0.0

# Preprocessing for target variable

data_y = list(cleaned_data.loc[:, ["status", "cycle"]].itertuples(index = None, name = None))
data_y = np.array(data_y, dtype=[('status', bool), ('cycle', float)])
data_y[:5]

array([( True, 149.), (False, 269.), ( True, 206.), (False, 235.),
       ( True, 154.)], dtype=[('status', '?'), ('cycle', '<f8')])

# Preprocessing for target variable

target_data = list(cleaned_data.loc[:, ["status", "cycle"]].itertuples(index = None, name = None))
target_data = np.array(target_data, dtype=[('status', bool), ('cycle', float)])
target_data[:5]

array([( True, 149.), (False, 269.), ( True, 206.), (False, 235.),
       ( True, 154.)], dtype=[('status', '?'), ('cycle', '<f8')])

Kaplan-Meier Estimator

From the data we have above, we can build a survival function Kaplan Meier Estimator. The Kaplan Meier Estimator is built by calculating the probability of observations that survive (do not experience events) at each time. Mathematically, the Kaplan Meier Estimator can be written as follows: $S(t) = \prod_{j=1}^{t} \frac{n_j - d_j}{n_j}$ where $n_j$ is the number of subjects at time $j$ and $d_j$ is the number of subjects that experienced the event.

from lifelines import KaplanMeierFitter

# Instantiate Kaplan Meier object
kmf = KaplanMeierFitter()

# Fit Kaplan Meier estimator
kmf.fit(durations=target_data['cycle'], event_observed=target_data['status'])

<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 260 total observations, 87 right-censored observations>

# Plot survival function with confidence intervals
kmf.plot_survival_function()

plt.title("Kaplan-Meier Estimator")
plt.ylabel("est. probability of survival $\hat{S}(t)$")
plt.xlabel("time $t$")

plt.show()

png

Interpretation:

Until the time $t=125$ the value of $S(t)=1$ which indicates that all machines are still running until the 125th cycle
When $t=175$ the value of $S(t)$ is around 0.75 indicating that after about 75% the engine is still running until the 175th cycle
The chance of a engine still having not failed is 50% at cycle 190.

Compare Survival Time for Each value in categorical columns

We have learned how to construct a survival function using the Kaplan Meier Estimator. Sometimes we want to do a comparison on whether different categories/treatments will affect the survival of our subjects.

For example, in cleaned_data we have two columns that have category types operational_setting_3 and sensor_measurement_16. Let’s check the number of unique values in these two columns

In this section, we will compare the distribution of survival time for each category in every categorical columns. From the distribution we hope that we can determine if there is a difference distribution for each category or not.

os3_unique = list(cleaned_data['operational_setting_3'].unique())
sm16_unique = list(cleaned_data['sensor_measurement_16'].unique())
print(f'Unique values operational_setting_3: {os3_unique}')
print(f'Unique values sensor_measurement_16: {sm16_unique}')

Unique values operational_setting_3: [100.0, 60.0]
Unique values sensor_measurement_16: [0.02, 0.03]

Now we will compare the distribution of the survival function between the values 100 and 60 in the operational_setting_3.

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

fig, ax = plt.subplots(1, 1, sharex='col', sharey='row')

# Mask for operational_setting_3
fullos3 = (cleaned_data['operational_setting_3'] == 100)

kmf = KaplanMeierFitter()
kmf.fit(durations=cleaned_data[fullos3]['cycle'], event_observed=cleaned_data[fullos3]['status'])
kmf.plot_survival_function(ax=ax)

kmf = KaplanMeierFitter()
kmf.fit(durations=cleaned_data[~fullos3]['cycle'], event_observed=cleaned_data[~fullos3]['status'])
kmf.plot_survival_function(ax=ax)


plt.title(f"Kaplan-Meier Estimator for both operational settings 3")
plt.legend(loc="best")
plt.xlabel('cycles')
plt.ylabel("est. probability of survival $\hat{S}(t)$")
plt.show()

png

Please notice that leaving confidence intervals (shaded areas) left aside, there is a broad gap between the survival curves. With the confidence intervals in the picture however, the overlapping confidence interval areas show that the observed gap might not indicate real differences in the survival functions. There is no significant difference that could be determined.

A different picture is that of two groups separated by a threshold of sensor measurements no 14 where there are significant different survival function curves:

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

threshold = 8140

fig, ax = plt.subplots(1, 1, sharex='col', sharey='row')

# Mask
sm14 = (cleaned_data['sensor_measurement_14'] > threshold)

kmf_exp = KaplanMeierFitter()
kmf_exp.fit(durations=cleaned_data[sm14]['cycle'], event_observed=cleaned_data[sm14]['status'], label='exp')
kmf_exp.plot_survival_function(ax=ax, show_censors=True, ci_legend=True, 
                               label='sm14 greater than ' + str(threshold)
                              )

kmf_control = KaplanMeierFitter()
kmf_control.fit(durations=cleaned_data[~sm14]['cycle'], event_observed=cleaned_data[~sm14]['status'], label='control')
kmf_control.plot_survival_function(ax=ax, show_censors=True, ci_legend=True, 
                                   label='sm14 not greater than ' + str(threshold)
                                  )


from lifelines.plotting import add_at_risk_counts
add_at_risk_counts(kmf_exp, kmf_control, ax=ax)

plt.title(f"Kaplan-Meier Estimator for each Category of Sensor Measurement 14")
plt.ylabel("est. probability of survival $\hat{S}(t)$")
plt.xlabel('cycles')
plt.show()

png

Log-Rank Test

In the previous section, we compared the distribution of survival functions by visualization. However, sometimes further statistical tests are required to determine whether the two distributions can be said to be the same or significantly different. To perform these tests, we can use the Log-. Rank Test with the usual statistical hypotheses.

$H_{0}$ : The distribution of the two survival functions is the same

$H_{1}$ : The distribution of the two survival functions is different

from sksurv.compare import compare_survival

cat.append('sensor_measurement_14')

p_value_list = []
for column in cat:
    p_value = compare_survival(data_y, turbofan_df[column])[1]
    p_value_list.append(p_value)
    
result = pd.DataFrame({'columns': cat, 'p-value': p_value_list}).set_index('columns')
result['conclusion'] = result['p-value'].apply(lambda x: "significant" if x < 0.05 else "not significant")
result

	p-value	conclusion
columns
operational_setting_3	5.391624e-03	significant
sensor_measurement_16	7.368610e-02	not significant
sensor_measurement_14	4.966909e-145	significant

cleaned_data

	sensor_measurement_04	sensor_measurement_11	sensor_measurement_14	operational_setting_3	sensor_measurement_16	status	cycle
machine_name
1	1149.81	42.77	8066.19	100.0	0.02	True	149
2	1146.65	42.64	8110.26	100.0	0.02	False	269
3	1144.89	42.85	8082.25	100.0	0.02	True	206
4	1428.21	48.25	8215.14	100.0	0.03	False	235
5	1146.87	42.67	8151.36	100.0	0.02	True	154
...	...	...	...	...	...	...	...
256	1425.84	48.36	8128.22	100.0	0.03	True	163
257	1146.53	42.70	8171.37	100.0	0.02	False	309
258	1326.79	46.08	8124.16	100.0	0.03	True	143
259	1149.97	42.73	8075.91	100.0	0.02	True	205
260	1145.52	42.50	8185.35	100.0	0.02	False	316

260 rows × 7 columns

cleaned_data[sm14]['cycle'].count()

cleaned_data[~sm14]['cycle'].count()

# Import logrank_test
from lifelines.statistics import logrank_test

sm16 = (cleaned_data['sensor_measurement_16'] == 0.02)

# Run log-rank test to compare groups by survival function
mask_dict = {'sensor_measurement_14': sm14, 'sensor_measurement_16': sm16, 'operational_setting_3': fullos3}
p_value_list = []

for item in mask_dict.items():
    lrt = logrank_test(durations_A = cleaned_data[item[1]]['cycle'], 
                       durations_B = cleaned_data[~item[1]]['cycle'], 
                       event_observed_A = cleaned_data[item[1]]['status'], 
                       event_observed_B = cleaned_data[~item[1]]['status'])
    p_value_list.append(lrt.p_value)

test_results = pd.DataFrame({'feature': mask_dict.keys(), 'p-value': p_value_list}).set_index('feature')
test_results['conclusion'] = test_results['p-value'].apply(lambda x: "significant" if x < 0.05 else "not significant")
test_results

	p-value	conclusion
feature
sensor_measurement_14	5.276097e-07	significant
sensor_measurement_16	7.368610e-02	not significant
operational_setting_3	5.391624e-03	significant

Conclusion : For the features operational_setting_3 and sensor_measurement_14 there are significant differences in the survival function between the groups.

Weibull Distribution

Lets fit the Weibull distribution as a parametric model for our time-to-event data. We use survival regression to make inferences about how covariates affect the survival function and select the best survival model for our data.

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

# Import WeibullFitter class
from lifelines import WeibullFitter

# Instantiate WeibullFitter
wb = WeibullFitter()

# Fit data to wb
wb.fit(data_y["cycle"], data_y["status"])

# Plot survival function
wb.survival_function_.plot()
plt.title(f"time-to-failure with Weibull model")
plt.ylabel("est. probability of survival $\hat{S}(t)$")
plt.xlabel('cycles')

plt.show()

png

threshold = 8100

# Mask for operational_setting_3
sm14 = (cleaned_data['sensor_measurement_14'] > threshold)

# Fit to sm14 greater than 8140
wb.fit(#durations=prison[parole]['week'], event_observed=prison[parole]['arrest']
      durations=cleaned_data[sm14]['cycle'], event_observed=cleaned_data[sm14]['status']
       #, label='exp'
      )

# Print rho
print("The rho_ parameter of group sm14 greater than 8140: ", wb.rho_)

# Fit to engines sm14 lower equal than 8140
wb.fit(durations=cleaned_data[~sm14]['cycle'], event_observed=cleaned_data[~sm14]['status'])

# Print rho
print("The rho_ parameter of group sm14 lower equal than 8140 survival function is: ", wb.rho_)

The rho_ parameter of group sm14 greater than 8140:  3.278560362707543
The rho_ parameter of group sm14 lower equal than 8140 survival function is:  3.8024815134050174

Both rho values are much larger than 1, which indicates that both groups’ rate of event increases as time goes on.

Multiple covariates with the Weibull Accelerated Failure Time model (Weibull AFT)

# Import WeibullAFTFitter and instantiate
from lifelines import WeibullAFTFitter
aft = WeibullAFTFitter()

# Fit heart_patients data into aft
aft.fit(cleaned_data, duration_col='cycle', event_col='status')

# Print the summary
print(aft.summary)

                                    coef     exp(coef)   se(coef)  \
param   covariate                                                   
lambda_ operational_setting_3  -0.057131  9.444701e-01   0.014956   
        sensor_measurement_04  -0.009645  9.904010e-01   0.002959   
        sensor_measurement_11   0.475113  1.608197e+00   0.147451   
        sensor_measurement_14   0.003070  1.003074e+00   0.000585   
        sensor_measurement_16  -0.564594  5.685911e-01  13.080660   
        Intercept             -22.867489  1.171590e-10   4.716158   
rho_    Intercept               1.306051  3.691566e+00   0.058606   

                               coef lower 95%  coef upper 95%  \
param   covariate                                               
lambda_ operational_setting_3       -0.086444       -0.027818   
        sensor_measurement_04       -0.015446       -0.003845   
        sensor_measurement_11        0.186115        0.764112   
        sensor_measurement_14        0.001924        0.004215   
        sensor_measurement_16      -26.202217       25.073029   
        Intercept                  -32.110988      -13.623990   
rho_    Intercept                    1.191185        1.420916   

                               exp(coef) lower 95%  exp(coef) upper 95%  \
param   covariate                                                         
lambda_ operational_setting_3         9.171866e-01         9.725653e-01   
        sensor_measurement_04         9.846729e-01         9.961623e-01   
        sensor_measurement_11         1.204560e+00         2.147087e+00   
        sensor_measurement_14         1.001926e+00         1.004224e+00   
        sensor_measurement_16         4.173705e-12         7.746014e+10   
        Intercept                     1.133379e-14         1.211090e-06   
rho_    Intercept                     3.290980e+00         4.140913e+00   

                               cmp to          z              p    -log2(p)  
param   covariate                                                            
lambda_ operational_setting_3     0.0  -3.819968   1.334689e-04   12.871209  
        sensor_measurement_04     0.0  -3.259247   1.117084e-03    9.806046  
        sensor_measurement_11     0.0   3.222178   1.272201e-03    9.618458  
        sensor_measurement_14     0.0   5.250649   1.515641e-07   22.653569  
        sensor_measurement_16     0.0  -0.043162   9.655720e-01    0.050544  
        Intercept                 0.0  -4.848754   1.242393e-06   19.618447  
rho_    Intercept                 0.0  22.285300  5.131147e-110  363.052809  

# Calculate the exponential of sensor measurement 11 coefficient
exp_sm11 = np.exp(aft.params_.loc['lambda_'].loc['sensor_measurement_11'])
print('When sensor_measurement_11 increases by 1, the average survival duration changes by a factor of', exp_sm11)

When sensor_measurement_11 increases by 1, the average survival duration changes by a factor of 1.608196633553678

# Fit data to aft
aft.fit(cleaned_data, duration_col='cycle', event_col='status')

# Plot partial effects of prio
#aft.plot_partial_effects_on_outcome(  covariates='sensor_measurement_14',   values=np.arange(8090, 8170, 15))
aft.plot_partial_effects_on_outcome(  covariates='sensor_measurement_11',   values=np.arange(42, 45, 0.5))

plt.axvline(x=250,ls="--",c=".5")

plt.title(f"Weibull baseline model vs. varied values of sensor measurements 11")
plt.ylabel("survival probability")
plt.xlabel('cycles')
plt.show()

png

Visualized in this ‘partial effects plot’ above, we can see how sensor measurements 11 determine survival times significantly while every other covariate holding equal, even half a point difference of 42.5 vs. 43.0 at cycle 250 halves the survival probability.

# Predict median of new data
aft_pred = aft.predict_median(cleaned_data)

# Print average predicted time to arrest
print("On average, the median number of cycles for an engine to fail is: ", np.mean(aft_pred))

On average, the median number of cycles for an engine to fail is:  226.12247644680892

In case we want use predictions of our models we have to validate/invalidate our models, which can happen numerically and graphically. The regular way to compare and validate different models is the Akaike information criterion (AIC) with which the best fitting model can be easily determined.

from lifelines.fitters.exponential_fitter import ExponentialFitter 
from lifelines.fitters.log_normal_fitter import LogNormalFitter
from lifelines.fitters.log_logistic_fitter import LogLogisticFitter

# Instantiate each fitter
wb = WeibullFitter()
exp = ExponentialFitter()
lognorm = LogNormalFitter()
loglog = LogLogisticFitter()

# Fit to data
for model in [wb, exp, lognorm, loglog]:
      model.fit(durations=cleaned_data[sm14]['cycle'], event_observed=cleaned_data[sm14]['status'] )
      # Print AIC
      print(model.__class__.__name__, model.AIC_)

WeibullFitter 1311.250877441705
ExponentialFitter 1440.767027487693
LogNormalFitter 1269.5021223158335
LogLogisticFitter 1274.537494437834

How good is the fit visually?

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.reshape(4,)

for i, model in enumerate([WeibullFitter(), LogNormalFitter(), LogLogisticFitter(), ExponentialFitter()]):
    model.fit(durations=cleaned_data[sm14]['cycle'], event_observed=cleaned_data[sm14]['status'] )
    qq_plot(model, ax=axes[i])

plt.suptitle(f"QQ Plot of Fitted models of different distributions of those engines with Sensor Measurement 14 above threshold {threshold}" )
plt.show()

png

The LogNormal Fitted Model fits the data best and reasonable well; minor tail separation from the diagonal line is expected and within acceptable bounds.

data_x = data_x[data_x.columns[~data_x.columns.isin(["sensor_measurement_16=0.03"])]]

Cox Propotional Hazard

The hazard function $ h(t) $ describes the probability that the event of interest will occur at a given time if individuals survive to that time. The hazard rates along this function are the instantaneous rates for the occurrence of the event. $h(t) = h_{0}(t) e^{\beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{n}x_{n}}$

The Cox Propotional Hazard (CoxPH) model is a semiparametric model that will predict the value of the hazard rate which consists of the following components:

Non-parametric $h_{0}(t)$ (baseline hazard) : Characterization of how the hazard function changes over time. This baseline hazard can also be interpreted as an intercept in the regression problem

Parametric $e^{\beta_{1}x_{1} + \beta_{2}x_{2} + … + \beta_{n}x_{n}}$ : Characterization of how the hazard function changes based on covariate conditions (variables involved in the model)

The survival function based on the above model can be obtained through the hazard function, which are closely related: $S(t) = e^{-\int{h(t)}} dt$

regularization alpha = 1e-4

from sksurv.linear_model import CoxPHSurvivalAnalysis

# Define model
estimator = CoxPHSurvivalAnalysis(verbose = 10)

# Fitting model
estimator.fit(data_x, data_y)

iter      1: update = [-0.02672891  1.30951039  0.00897907 -6.14944208]
iter      1: loss = 3.4407602869
iter      2: update = [ 0.00043914 -0.0348166   0.00100936 -0.27206197]
iter      2: loss = 3.4389450619
iter      3: update = [-1.01083581e-05  6.09861343e-04 -3.68327372e-06 -1.80823588e-02]
iter      3: loss = 3.4389378698
iter      4: update = [ 1.34113910e-08 -7.92317762e-07  2.80717143e-09 -1.08672544e-04]
iter      4: loss = 3.4389378694
iter      4: optimization converged





CoxPHSurvivalAnalysis(verbose=10)

cleaned_data = cleaned_data.drop('sensor_measurement_16', axis=1)

# Import CoxPHFitter class
from lifelines import CoxPHFitter

# Instantiate CoxPHFitter class cph
cph = CoxPHFitter(
    #penalizer=0.001
)

# Fit cph to data
cph.fit(df=cleaned_data, duration_col="cycle", event_col="status")

# Print model summary
print(cph.summary)

                           coef  exp(coef)  se(coef)  coef lower 95%  \
covariate                                                              
sensor_measurement_04  0.026635   1.026993  0.009872        0.007286   
sensor_measurement_11 -1.291693   0.274805  0.515289       -2.301640   
sensor_measurement_14 -0.010052   0.989998  0.002043       -0.014057   
operational_setting_3  0.162776   1.176773  0.053030        0.058840   

                       coef upper 95%  exp(coef) lower 95%  \
covariate                                                    
sensor_measurement_04        0.045984             1.007312   
sensor_measurement_11       -0.281746             0.100095   
sensor_measurement_14       -0.006047             0.986041   
operational_setting_3        0.266712             1.060605   

                       exp(coef) upper 95%  cmp to         z             p  \
covariate                                                                    
sensor_measurement_04             1.047058     0.0  2.697954  6.976709e-03   
sensor_measurement_11             0.754465     0.0 -2.506737  1.218513e-02   
sensor_measurement_14             0.993971     0.0 -4.919302  8.685343e-07   
operational_setting_3             1.305664     0.0  3.069527  2.143978e-03   

                        -log2(p)  
covariate                         
sensor_measurement_04   7.163238  
sensor_measurement_11   6.358734  
sensor_measurement_14  20.134914  
operational_setting_3   8.865494  

The hazard ratio is exp(coef) which is e to the power of a coefficient. It indicates how much the baseline hazard changes with a one-unit change in the covariate.

So with one unit increase in operational_setting_3 the hazard changes by a factor of 1.137, a 13.7% increase compared to the baseline hazard.

All four covariates are statistically significant as per column p, indicating that there is a strong correlation between the changes in all covariates and the hazards.

# Coef model
pd.Series(estimator.coef_, index = data_x.columns)

sensor_measurement_04          0.026300
sensor_measurement_11         -1.275303
sensor_measurement_14         -0.009985
operational_setting_3=100.0    6.439695
dtype: float64

Interpretation:

Every one unit increase in sensor_measurement_04 will increase the hazard rate by $e^{0.02} = 1.02$ assuming no changes in other covariates
Every one unit increase in sensor_measurement_11 will reduce the hazard rate by $e^{-1.13} = 0. 32$ assuming no change in the other covariates
Using operational_setting_3 = 100 will increase the hazard rate by $e^{6.26} = 523.21$ compared to operational_setting_3 = 60

# Assign summary to summary_df
summary_df = cph.summary

# Create new column of survival time ratios
summary_df["surv_ratio"] = 1 / summary_df['exp(coef)']

summary_df

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	cmp to	z	p	-log2(p)	surv_ratio
covariate
sensor_measurement_04	0.026635	1.026993	0.009872	0.007286	0.045984	1.007312	1.047058	0.0	2.697954	6.976709e-03	7.163238	0.973717
sensor_measurement_11	-1.291693	0.274805	0.515289	-2.301640	-0.281746	0.100095	0.754465	0.0	-2.506737	1.218513e-02	6.358734	3.638942
sensor_measurement_14	-0.010052	0.989998	0.002043	-0.014057	-0.006047	0.986041	0.993971	0.0	-4.919302	8.685343e-07	20.134914	1.010103
operational_setting_3	0.162776	1.176773	0.053030	0.058840	0.266712	1.060605	1.305664	0.0	3.069527	2.143978e-03	8.865494	0.849782

# Print surv_ratio for prio
survivalratio = summary_df.loc['sensor_measurement_04', "surv_ratio"]
survivalratio

0.973716525421316

TimeToFailureChange = survivalratio - 1
TimeToFailureChange

-0.026283474578683963

print(f'Time-to-failure changes by {TimeToFailureChange * 100:.2f} %')

Time-to-failure changes by -2.63 %

So the impact on survival time with a one-unit increase in sensor_measurement_04 is more than negative 2 percent.

covars = list(summary_df["surv_ratio"].index)
covars

['sensor_measurement_04',
 'sensor_measurement_11',
 'sensor_measurement_14',
 'operational_setting_3']

sorted(covars)

['operational_setting_3',
 'sensor_measurement_04',
 'sensor_measurement_11',
 'sensor_measurement_14']

# Plot partial effects
cph.plot_partial_effects_on_outcome(covariates=sorted(covars),
         values=[[100.0,1140,42,8080], 
                 [100.0,1130,43,8100], 
                 [100.0,1150,45,8120],
                 [60,   1150,40,8080]])

# Show plot
plt.title("Indiviual survival curves of four prototype Turbofan Jet engines with specific covariate properties - Time to Failure Estimation" )
plt.ylabel("est. probability of survival")
#plt.xlabel("cycles")
plt.legend(loc="upper right")
plt.show()

png

# Check PH assumption
print(cph.check_assumptions(training_df=cleaned_data, p_value_threshold=0.05, show_plots=True))

The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.

null_distribution	chi squared
degrees_of_freedom	1
model	<lifelines.CoxPHFitter: fitted with 260 total ...
test_name	proportional_hazard_test

		test_statistic	p	-log2(p)
operational_setting_3	km	0.87	0.35	1.51
operational_setting_3	rank	0.95	0.33	1.60
sensor_measurement_04	km	1.58	0.21	2.26
sensor_measurement_04	rank	1.66	0.20	2.34
sensor_measurement_11	km	2.10	0.15	2.77
sensor_measurement_11	rank	2.20	0.14	2.86
sensor_measurement_14	km	8.35	<0.005	8.02
sensor_measurement_14	rank	8.12	<0.005	7.84

1. Variable 'sensor_measurement_14' failed the non-proportional test: p-value is 0.0039.

   Advice 1: the functional form of the variable 'sensor_measurement_14' might be incorrect. That
is, there may be non-linear terms missing. The proportional hazard test used is very sensitive to
incorrect functional forms. See documentation in link [D] below on how to specify a functional form.

   Advice 2: try binning the variable 'sensor_measurement_14' using pd.cut, and then specify it in
`strata=['sensor_measurement_14', ...]` in the call in `.fit`. See documentation in link [B] below.

   Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.


   Bootstrapping lowess lines. May take a moment...


---
[A]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification

[[<AxesSubplot:xlabel='rank-transformed time\n(p=0.0044)'>, <AxesSubplot:xlabel='km-transformed time\n(p=0.0039)'>]]

png

Stratify sm14 and redo straified COXPH fitting

cleaned_strata_data = cleaned_data.copy()
cleaned_data.sensor_measurement_14.describe()

count     260.000000
mean     8105.652538
std        74.714146
min      7850.960000
25%      8081.580000
50%      8116.755000
75%      8148.730000
max      8252.610000
Name: sensor_measurement_14, dtype: float64

cleaned_strata_data

	sensor_measurement_04	sensor_measurement_11	sensor_measurement_14	operational_setting_3	status	cycle
machine_name
1	1149.81	42.77	8066.19	100.0	True	149
2	1146.65	42.64	8110.26	100.0	False	269
3	1144.89	42.85	8082.25	100.0	True	206
4	1428.21	48.25	8215.14	100.0	False	235
5	1146.87	42.67	8151.36	100.0	True	154
...	...	...	...	...	...	...
256	1425.84	48.36	8128.22	100.0	True	163
257	1146.53	42.70	8171.37	100.0	False	309
258	1326.79	46.08	8124.16	100.0	True	143
259	1149.97	42.73	8075.91	100.0	True	205
260	1145.52	42.50	8185.35	100.0	False	316

260 rows × 6 columns

cph2 = CoxPHFitter()
cph2.fit(df=cleaned_strata_data, duration_col="cycle", event_col="status", 
         #strata=['sm14_strata'], 
         formula="bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300) + sensor_measurement_04 + sensor_measurement_11 + operational_setting_3"
        )

<lifelines.CoxPHFitter: fitted with 260 total observations, 81 right-censored observations>

cph2.print_summary(3)
sns.set(rc = {'figure.figsize':(7,7)})
cph2.plot()
plt.show()

model	lifelines.CoxPHFitter
duration col	'cycle'
event col	'status'
baseline estimation	breslow
number of observations	260
number of events observed	179
partial log-likelihood	-891.762
time fit was run	2022-09-17 08:02:35 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[1]	0.931	2.537	2.994	-4.937	6.799	0.007	896.764	0.311	0.756	0.404
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[2]	-4.767	0.009	2.662	-9.984	0.451	0.000	1.570	-1.791	0.073	3.769
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[3]	-2.244	0.106	2.345	-6.840	2.351	0.001	10.500	-0.957	0.338	1.563
operational_setting_3[T.100.0]	6.298	543.528	2.649	1.107	11.489	3.025	97671.295	2.378	0.017	5.844
sensor_measurement_04	0.027	1.027	0.010	0.008	0.046	1.008	1.047	2.725	0.006	7.281
sensor_measurement_11	-1.298	0.273	0.514	-2.306	-0.290	0.100	0.749	-2.523	0.012	6.425

<div>

Concordance	0.650
Partial AIC	1795.525
log-likelihood ratio test	39.294 on 6 df
-log2(p) of ll-ratio test	20.605

</div>

png

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

cph2.check_assumptions(cleaned_strata_data)

The ``p_value_threshold`` is set at 0.01. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.

null_distribution	chi squared
degrees_of_freedom	1
model	<lifelines.CoxPHFitter: fitted with 260 total ...
test_name	proportional_hazard_test

		test_statistic	p	-log2(p)
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[1]	km	3.38	0.07	3.92
	rank	3.38	0.07	3.92
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[2]	km	0.31	0.58	0.79
	rank	0.29	0.59	0.76
bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[3]	km	6.63	0.01	6.64
	rank	6.59	0.01	6.61
operational_setting_3[T.100.0]	km	0.06	0.80	0.32
operational_setting_3[T.100.0]	rank	0.08	0.77	0.37
sensor_measurement_04	km	1.70	0.19	2.38
sensor_measurement_04	rank	1.78	0.18	2.46
sensor_measurement_11	km	2.21	0.14	2.87
sensor_measurement_11	rank	2.31	0.13	2.96

1. Variable 'bs(sensor_measurement_14, df=3, lower_bound=7850, upper_bound=8300)[3]' failed the non-proportional test: p-value is 0.0100.

   Advice 1: the functional form of the variable 'bs(sensor_measurement_14, df=3, lower_bound=7850,
upper_bound=8300)[3]' might be incorrect. That is, there may be non-linear terms missing. The
proportional hazard test used is very sensitive to incorrect functional forms. See documentation in
link [D] below on how to specify a functional form.

   Advice 2: try binning the variable 'bs(sensor_measurement_14, df=3, lower_bound=7850,
upper_bound=8300)[3]' using pd.cut, and then specify it in `strata=['bs(sensor_measurement_14, df=3,
lower_bound=7850, upper_bound=8300)[3]', ...]` in the call in `.fit`. See documentation in link [B]
below.

   Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.


---
[A]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification






[]

Time to Failure prediction for a new set of turbofan jets engines

# Select engines that have not failed yet
engines_still_working = cleaned_strata_data.loc[cleaned_strata_data['status'] == False].copy()

engines_still_working.loc[:, 'age_cycles'] = 200
engines_still_working

	sensor_measurement_04	sensor_measurement_11	sensor_measurement_14	operational_setting_3	status	cycle	age_cycles
machine_name
2	1146.65	42.64	8110.26	100.0	False	269	200
4	1428.21	48.25	8215.14	100.0	False	235	200
11	1144.57	42.83	8167.76	100.0	False	271	200
12	1326.77	46.25	8132.04	100.0	False	249	200
13	1146.18	42.67	8055.61	100.0	False	227	200
...	...	...	...	...	...	...	...
251	1145.99	42.61	8144.66	100.0	False	266	200
254	1075.12	37.46	7855.14	60.0	False	260	200
255	1143.87	42.79	8091.37	100.0	False	340	200
257	1146.53	42.70	8171.37	100.0	False	309	200
260	1145.52	42.50	8185.35	100.0	False	316	200

81 rows × 7 columns

# Existing durations of employees that have not churned
engines_still_working_last_obs = engines_still_working['age_cycles']

# Predict survival function conditional on existing durations
cph2.predict_survival_function(engines_still_working, 
                              conditional_after=engines_still_working_last_obs)

	2	4	11	12	13	31	32	41	42	47	...	240	242	243	245	248	251	254	255	257	260
128.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
129.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
133.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
135.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
136.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
343.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
344.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
347.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
365.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147
378.0	0.638475	0.749801	0.82612	0.662427	0.422205	0.704433	0.761802	0.532168	0.660271	0.815766	...	0.600644	0.770246	0.842319	0.756958	0.688897	0.727469	0.406703	0.649048	0.79238	0.762147

133 rows × 81 columns

# Predict median remaining times for current employees
pred = cph2.predict_median(engines_still_working, conditional_after=engines_still_working_last_obs)

# Print the smallest median remaining time
print(pred)

      inf
      inf
     inf
     inf
   128.0
       ...  
    inf
  128.0
    inf
    inf
    inf
Name: 0.5, Length: 81, dtype: float64

numbers =pred != np.inf 
numbers.sum()

So far we have ignored the time series component in this data set reviewed the problem as a classical surivival problem where usually covariates do not vary or are not recorded.

Now we

Predict Remaining Useful Lifetime (RUL) with a CoxTimeVary Model

The fitter of the CoxTimeVarying model, which can take advantage of the time series nature of the data, as it is able to consider multiple observations for each engine.

The downside of this model is that its results are less intuitive to interpret. In general, higher partial hazards indicate a greater risk of breakdown, but this is not a direct indicator of remaining useful lifetime (time-to-failure).

#train_cols = index_names + remaining_sensors + ['start', 'breakdown']
#predict_cols = ['time_cycles'] + remaining_sensors + ['start', 'breakdown']  # breakdown value will be 0

timevary_df = timeseries_df.copy()
timevary_df_test = timeseries_df_test.copy()
timevary_df.head()

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21
0	1	1	34.9983	0.8400	100.0	449.44	555.32	1358.61	1137.23	5.48	...	183.06	2387.72	8048.56	9.3461	0.02	334	2223	100.00	14.73	8.8071
1	1	2	41.9982	0.8408	100.0	445.00	549.90	1353.22	1125.78	3.91	...	130.42	2387.66	8072.30	9.3774	0.02	330	2212	100.00	10.41	6.2665
2	1	3	24.9988	0.6218	60.0	462.54	537.31	1256.76	1047.45	7.05	...	164.22	2028.03	7864.87	10.8941	0.02	309	1915	84.93	14.08	8.6723
3	1	4	42.0077	0.8416	100.0	445.00	549.51	1354.03	1126.38	3.91	...	130.72	2387.61	8068.66	9.3528	0.02	329	2212	100.00	10.59	6.4701
4	1	5	25.0005	0.6203	60.0	462.54	537.07	1257.71	1047.93	7.05	...	164.31	2028.00	7861.23	10.8963	0.02	309	1915	84.93	14.13	8.5286

5 rows × 26 columns

timevary_df.groupby(by="machine_name")['cycle'].max()

machine_name
    149
    269
    206
    235
    154
      ... 
  163
  309
  143
  205
  316
Name: cycle, Length: 260, dtype: int64

# RUL
RUL_timevary_df = timevary_df.merge(timevary_df.groupby(by="machine_name")['cycle'].max().to_frame(name='fail_cycle'), left_on='machine_name', right_index=True)
RUL_timevary_df_test = timevary_df_test.merge(timevary_df_test.groupby(by="machine_name")['cycle'].max().to_frame(name='fail_cycle'), left_on='machine_name', right_index=True)
RUL_timevary_df 

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21	fail_cycle
0	1	1	34.9983	0.8400	100.0	449.44	555.32	1358.61	1137.23	5.48	...	2387.72	8048.56	9.3461	0.02	334	2223	100.00	14.73	8.8071	149
1	1	2	41.9982	0.8408	100.0	445.00	549.90	1353.22	1125.78	3.91	...	2387.66	8072.30	9.3774	0.02	330	2212	100.00	10.41	6.2665	149
2	1	3	24.9988	0.6218	60.0	462.54	537.31	1256.76	1047.45	7.05	...	2028.03	7864.87	10.8941	0.02	309	1915	84.93	14.08	8.6723	149
3	1	4	42.0077	0.8416	100.0	445.00	549.51	1354.03	1126.38	3.91	...	2387.61	8068.66	9.3528	0.02	329	2212	100.00	10.59	6.4701	149
4	1	5	25.0005	0.6203	60.0	462.54	537.07	1257.71	1047.93	7.05	...	2028.00	7861.23	10.8963	0.02	309	1915	84.93	14.13	8.5286	149
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
53754	260	312	20.0037	0.7000	100.0	491.19	608.79	1495.60	1269.51	9.35	...	2389.02	8169.64	9.3035	0.03	369	2324	100.00	24.36	14.5189	316
53755	260	313	10.0022	0.2510	100.0	489.05	605.81	1514.32	1324.12	10.52	...	2388.42	8245.36	8.7586	0.03	374	2319	100.00	28.10	16.9454	316
53756	260	314	25.0041	0.6200	60.0	462.54	537.48	1276.24	1057.92	7.05	...	2030.33	7971.25	11.0657	0.02	310	1915	84.93	14.19	8.5503	316
53757	260	315	25.0033	0.6220	60.0	462.54	537.84	1272.95	1066.30	7.05	...	2030.35	7972.47	11.0537	0.02	311	1915	84.93	14.05	8.3729	316
53758	260	316	35.0036	0.8400	100.0	449.44	556.64	1374.61	1145.52	5.48	...	2390.38	8185.35	9.3998	0.02	338	2223	100.00	14.75	8.8446	316

53759 rows × 27 columns

RUL_timevary_df['start'] = RUL_timevary_df['cycle'] - 1
RUL_timevary_df_test['start'] = RUL_timevary_df_test['cycle'] - 1

#timevary_df['status'] = timevary_df['cycle'].apply(lambda x: False if x > timevary_df['fail_cycle'] else True)
RUL_timevary_df['status'] = np.where(RUL_timevary_df['cycle']>=RUL_timevary_df ['fail_cycle'], True, False)
RUL_timevary_df_test['status'] = np.where(RUL_timevary_df_test['cycle']>=RUL_timevary_df_test['fail_cycle'], True, False)

# Calculate remaining useful life for each row
RUL_timevary_df["RUL"] = RUL_timevary_df["fail_cycle"] - RUL_timevary_df["cycle"]
RUL_timevary_df_test["RUL"] = RUL_timevary_df_test["fail_cycle"] - RUL_timevary_df_test["cycle"]
RUL_timevary_df_test

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_01	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_05	...	sensor_measurement_16	sensor_measurement_17	sensor_measurement_18	sensor_measurement_19	sensor_measurement_20	sensor_measurement_21	fail_cycle	start	status	RUL
0	1	1	9.9987	0.2502	100.0	489.05	605.03	1497.17	1304.99	10.52	...	0.03	369	2319	100.00	28.42	17.1551	258	0	False	257
1	1	2	20.0026	0.7000	100.0	491.19	607.82	1481.20	1246.11	9.35	...	0.02	364	2324	100.00	24.29	14.8039	258	1	False	256
2	1	3	35.0045	0.8400	100.0	449.44	556.00	1359.08	1128.36	5.48	...	0.02	333	2223	100.00	14.98	8.9125	258	2	False	255
3	1	4	42.0066	0.8410	100.0	445.00	550.17	1349.69	1127.89	3.91	...	0.02	332	2212	100.00	10.35	6.4181	258	3	False	254
4	1	5	24.9985	0.6213	60.0	462.54	536.72	1253.18	1050.69	7.05	...	0.02	305	1915	84.93	14.31	8.5740	258	4	False	253
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
33986	259	119	35.0015	0.8403	100.0	449.44	555.56	1366.01	1129.47	5.48	...	0.02	334	2223	100.00	14.94	8.9065	123	118	False	4
33987	259	120	42.0066	0.8405	100.0	445.00	549.42	1351.13	1123.86	3.91	...	0.02	332	2212	100.00	10.57	6.4075	123	119	False	3
33988	259	121	42.0061	0.8400	100.0	445.00	549.65	1349.14	1118.91	3.91	...	0.02	331	2212	100.00	10.57	6.4805	123	120	False	2
33989	259	122	0.0024	0.0003	100.0	518.67	642.58	1589.61	1408.16	14.62	...	0.03	393	2388	100.00	39.08	23.3589	123	121	False	1
33990	259	123	42.0033	0.8400	100.0	445.00	549.77	1342.50	1126.96	3.91	...	0.02	331	2212	100.00	10.63	6.3480	123	122	True	0

33991 rows × 30 columns

from lifelines import CoxTimeVaryingFitter

#RUL_timevary_df_training = RUL_timevary_df.drop(['fail_cycle', 'RUL'], axis=1)
# drop additional 
RUL_timevary_df_training = RUL_timevary_df[['machine_name', 'cycle', 'operational_setting_1',
       'operational_setting_2', 'operational_setting_3',
        'sensor_measurement_02',
       'sensor_measurement_03', 'sensor_measurement_04',
      
       'sensor_measurement_07', 'sensor_measurement_08',
       'sensor_measurement_09', 'sensor_measurement_10',
       'sensor_measurement_11', 'sensor_measurement_12',
       'sensor_measurement_13', 'sensor_measurement_14',
       'sensor_measurement_15', 
       'sensor_measurement_17',  'sensor_measurement_20',
       'sensor_measurement_21', 'start', 'status']]

ctv = CoxTimeVaryingFitter()
ctv.fit(RUL_timevary_df_training, id_col="machine_name", event_col='status', 
        start_col='start', stop_col='cycle', show_progress=True, step_size=1)

Iteration 50: norm_delta = 2.70442, step_size = 0.23075, ll = -525.47872, newton_decrement = 0.31545, seconds_since_start = 3.4Convergence failed. See any warning messages.

<lifelines.CoxTimeVaryingFitter: fitted with 53759 periods, 260 subjects, 260 events>

ctv.print_summary()
plt.figure(figsize=(10,5))
ctv.plot()
plt.show()
plt.close()

model	lifelines.CoxTimeVaryingFitter
event col	'status'
number of subjects	260
number of periods	53759
number of events	260
partial log-likelihood	-525.48
time fit was run	2022-09-17 08:02:36 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
operational_setting_1	-1.02	0.36	0.22	-1.45	-0.58	0.23	0.56	-4.58	<0.005	17.71
operational_setting_2	30.86	2.51e+13	11.87	7.59	54.12	1988.10	3.18e+23	2.60	0.01	6.74
operational_setting_3	-14.26	0.00	2.21	-18.59	-9.92	0.00	0.00	-6.44	<0.005	32.97
sensor_measurement_02	1.50	4.49	0.25	1.01	1.99	2.76	7.31	6.04	<0.005	29.31
sensor_measurement_03	0.09	1.10	0.02	0.06	0.13	1.06	1.14	5.30	<0.005	23.00
sensor_measurement_04	0.11	1.11	0.02	0.07	0.14	1.08	1.15	6.38	<0.005	32.36
sensor_measurement_07	0.11	1.11	0.14	-0.17	0.38	0.85	1.46	0.77	0.44	1.18
sensor_measurement_08	-1.42	0.24	0.14	-1.69	-1.15	0.18	0.32	-10.32	<0.005	80.56
sensor_measurement_09	0.03	1.03	0.01	0.00	0.06	1.00	1.06	2.10	0.04	4.80
sensor_measurement_10	-0.45	0.64	29.28	-57.83	56.94	0.00	5.34e+24	-0.02	0.99	0.02
sensor_measurement_11	7.20	1344.92	0.66	5.91	8.50	367.50	4921.91	10.88	<0.005	89.22
sensor_measurement_12	0.06	1.07	0.15	-0.23	0.36	0.79	1.44	0.42	0.67	0.57
sensor_measurement_13	2.61	13.62	0.28	2.07	3.16	7.91	23.46	9.41	<0.005	67.49
sensor_measurement_14	-0.04	0.96	0.02	-0.07	-0.01	0.93	0.99	-2.65	0.01	6.97
sensor_measurement_15	18.04	6.82e+07	3.31	11.55	24.52	1.04e+05	4.46e+10	5.45	<0.005	24.27
sensor_measurement_17	0.40	1.50	0.07	0.26	0.54	1.30	1.72	5.62	<0.005	25.66
sensor_measurement_20	-1.43	0.24	0.70	-2.80	-0.05	0.06	0.95	-2.04	0.04	4.59
sensor_measurement_21	-3.36	0.03	1.19	-5.69	-1.04	0.00	0.35	-2.84	<0.005	7.77

<div>

Partial AIC	1086.96
log-likelihood ratio test	1328.00 on 18 df
-log2(p) of ll-ratio test	898.23

</div>

png

Predict And Evaluate

RUL_timevary_df_rcensored = RUL_timevary_df[['machine_name', 'cycle', 'operational_setting_1',
       'operational_setting_2', 'operational_setting_3',
        'sensor_measurement_02',
       'sensor_measurement_03', 'sensor_measurement_04',
      
       'sensor_measurement_07', 'sensor_measurement_08',
       'sensor_measurement_09', 'sensor_measurement_10',
       'sensor_measurement_11', 'sensor_measurement_12',
       'sensor_measurement_13', 'sensor_measurement_14',
       'sensor_measurement_15', 
       'sensor_measurement_17',  'sensor_measurement_20',
       'sensor_measurement_21', 'start', 'status', "RUL"]].copy()[RUL_timevary_df['cycle'] <= 215]

RUL_timevary_df_rcensored

	machine_name	cycle	operational_setting_1	operational_setting_2	operational_setting_3	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_07	sensor_measurement_08	...	sensor_measurement_12	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_17	sensor_measurement_20	sensor_measurement_21	start	status	RUL
0	1	1	34.9983	0.8400	100.0	555.32	1358.61	1137.23	194.64	2222.65	...	183.06	2387.72	8048.56	9.3461	334	14.73	8.8071	0	False	148
1	1	2	41.9982	0.8408	100.0	549.90	1353.22	1125.78	138.51	2211.57	...	130.42	2387.66	8072.30	9.3774	330	10.41	6.2665	1	False	147
2	1	3	24.9988	0.6218	60.0	537.31	1256.76	1047.45	175.71	1915.11	...	164.22	2028.03	7864.87	10.8941	309	14.08	8.6723	2	False	146
3	1	4	42.0077	0.8416	100.0	549.51	1354.03	1126.38	138.46	2211.58	...	130.72	2387.61	8068.66	9.3528	329	10.59	6.4701	3	False	145
4	1	5	25.0005	0.6203	60.0	537.07	1257.71	1047.93	175.05	1915.10	...	164.31	2028.00	7861.23	10.8963	309	14.13	8.5286	4	False	144
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
53653	260	211	9.9985	0.2500	100.0	604.58	1510.23	1308.38	394.46	2318.91	...	371.71	2388.11	8145.03	8.6239	368	28.69	17.1938	210	False	105
53654	260	212	35.0062	0.8400	100.0	555.67	1366.07	1132.65	194.45	2223.23	...	183.63	2388.42	8083.57	9.2709	335	14.82	8.8813	211	False	104
53655	260	213	42.0066	0.8404	100.0	549.48	1353.35	1123.50	138.63	2212.20	...	130.71	2388.38	8101.10	9.3430	331	10.65	6.3974	212	False	103
53656	260	214	0.0026	0.0001	100.0	643.07	1595.53	1409.50	554.03	2388.08	...	521.82	2388.08	8159.58	8.4068	394	38.76	23.2706	213	False	102
53657	260	215	34.9994	0.8400	100.0	555.75	1363.27	1132.58	194.05	2223.26	...	183.14	2388.43	8088.93	9.3100	335	14.90	8.8744	214	False	101

49862 rows × 23 columns

preds_df = RUL_timevary_df_rcensored.groupby("machine_name").last()
preds_df = preds_df[preds_df.status == False].copy()

preds_df_wo_RUL = preds_df.drop(['RUL'], axis=1)

predictions = pd.DataFrame(ctv.predict_log_partial_hazard(preds_df_wo_RUL), columns=['predictions'])
predictions.index = preds_df.index

predictions['RUL'] = preds_df['RUL']
predictions.head(10)

	predictions	RUL
machine_name
2	4.822688	54
4	7.431523	20
11	-0.342774	56
12	3.713341	34
13	7.702615	12
31	-2.501934	128
32	6.108181	66
41	4.688602	47
42	11.195667	6
47	5.124843	31

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

plt.figure(figsize=(15,5))
plt.plot(predictions['RUL'], predictions['predictions'], '.b')
xlim = plt.gca().get_xlim()
plt.xlim(xlim[1], xlim[0])
plt.xlabel('RUL')
plt.ylabel('log_partial_hazard')
plt.title('log_partial_hazard vs. Remaining useful Lifetime of all engines still working after cycle 215')
plt.show()

png

Predicting partial hazards

partial_hazard_df = RUL_timevary_df[['machine_name', 'cycle', 
       'operational_setting_2', 'operational_setting_3',
        'sensor_measurement_02',
       'sensor_measurement_03', 'sensor_measurement_04',
       'sensor_measurement_07', 'sensor_measurement_08',
       'sensor_measurement_09', 'sensor_measurement_10',
       'sensor_measurement_11', 'sensor_measurement_12',
       'sensor_measurement_13', 'sensor_measurement_14',
       'sensor_measurement_15', 
       'sensor_measurement_17',  'sensor_measurement_20',
       'sensor_measurement_21',  'start', 'status', "RUL"]].copy()

partial_hazard_df_test = RUL_timevary_df_test[['machine_name', 'cycle', 
       'operational_setting_2', 'operational_setting_3',
        'sensor_measurement_02',
       'sensor_measurement_03', 'sensor_measurement_04',
       'sensor_measurement_07', 'sensor_measurement_08',
       'sensor_measurement_09', 'sensor_measurement_10',
       'sensor_measurement_11', 'sensor_measurement_12',
       'sensor_measurement_13', 'sensor_measurement_14',
       'sensor_measurement_15', 
       'sensor_measurement_17',  'sensor_measurement_20',
       'sensor_measurement_21',  'start', 'status', "RUL"]].copy()

partial_hazard_df_wo_RUL =  partial_hazard_df.drop(['RUL'], axis=1)
partial_hazard_df_test_wo_RUL =  partial_hazard_df_test.drop(['RUL'], axis=1)
partial_hazard_df_wo_RUL.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53759 entries, 0 to 53758
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 machine_name           53759 non-null  int64  
 cycle                  53759 non-null  int64  
 operational_setting_2  53759 non-null  float64
 operational_setting_3  53759 non-null  float64
 sensor_measurement_02  53759 non-null  float64
 sensor_measurement_03  53759 non-null  float64
 sensor_measurement_04  53759 non-null  float64
 sensor_measurement_07  53759 non-null  float64
 sensor_measurement_08  53759 non-null  float64
 sensor_measurement_09  53759 non-null  float64
sensor_measurement_10  53759 non-null  float64
sensor_measurement_11  53759 non-null  float64
sensor_measurement_12  53759 non-null  float64
sensor_measurement_13  53759 non-null  float64
sensor_measurement_14  53759 non-null  float64
sensor_measurement_15  53759 non-null  float64
sensor_measurement_17  53759 non-null  int64  
sensor_measurement_20  53759 non-null  float64
sensor_measurement_21  53759 non-null  float64
start                  53759 non-null  int64  
status                 53759 non-null  bool   
dtypes: bool(1), float64(16), int64(4)
memory usage: 8.7 MB

ctv_all = CoxTimeVaryingFitter()
ctv_all.fit(partial_hazard_df_wo_RUL, id_col="machine_name", event_col='status', 
        start_col='start', stop_col='cycle', show_progress=True, step_size=1)

Iteration 45: norm_delta = 0.00315, step_size = 1.00000, ll = -537.06675, newton_decrement = 0.00000, seconds_since_start = 2.8Convergence completed after 45 iterations.

<lifelines.CoxTimeVaryingFitter: fitted with 53759 periods, 260 subjects, 260 events>

partial_hazard_df['log_partial_hazard'] = ctv_all.predict_log_partial_hazard(partial_hazard_df_wo_RUL)
partial_hazard_df_test['log_partial_hazard'] = ctv_all.predict_log_partial_hazard(partial_hazard_df_test_wo_RUL)
partial_hazard_df[-220:-1]

	machine_name	cycle	operational_setting_2	operational_setting_3	sensor_measurement_02	sensor_measurement_03	sensor_measurement_04	sensor_measurement_07	sensor_measurement_08	sensor_measurement_09	...	sensor_measurement_13	sensor_measurement_14	sensor_measurement_15	sensor_measurement_17	sensor_measurement_20	sensor_measurement_21	start	status	RUL	log_partial_hazard
53539	260	97	0.6200	60.0	537.22	1255.25	1046.52	176.24	1915.32	8008.15	...	2028.26	7874.49	10.8557	306	14.42	8.5486	96	False	219	-2.405619
53540	260	98	0.0006	100.0	642.20	1586.33	1395.17	553.74	2388.05	9058.69	...	2388.10	8135.97	8.3956	392	38.94	23.4231	97	False	218	-7.915739
53541	260	99	0.0020	100.0	642.39	1593.33	1400.30	554.78	2388.07	9054.35	...	2388.07	8137.09	8.4222	392	38.97	23.3577	98	False	217	-5.113721
53542	260	100	0.8400	100.0	549.40	1355.57	1124.39	138.06	2211.87	8325.03	...	2388.03	8088.60	9.3724	330	10.72	6.3828	99	False	216	-2.618363
53543	260	101	0.6200	60.0	536.92	1262.05	1042.80	175.73	1915.34	8010.41	...	2028.23	7873.90	10.8736	307	14.17	8.5988	100	False	215	-1.891851
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
53753	260	311	0.0010	100.0	644.15	1604.81	1427.53	552.02	2388.13	9184.23	...	2388.14	8246.20	8.5303	396	38.46	23.1252	310	False	5	11.023303
53754	260	312	0.7000	100.0	608.79	1495.60	1269.51	334.00	2324.92	8852.27	...	2389.02	8169.64	9.3035	369	24.36	14.5189	311	False	4	7.773097
53755	260	313	0.2510	100.0	605.81	1514.32	1324.12	392.46	2319.17	8915.86	...	2388.42	8245.36	8.7586	374	28.10	16.9454	312	False	3	11.300359
53756	260	314	0.6200	60.0	537.48	1276.24	1057.92	175.11	1917.37	8114.33	...	2030.33	7971.25	11.0657	310	14.19	8.5503	313	False	2	11.386472
53757	260	315	0.6220	60.0	537.84	1272.95	1066.30	174.81	1917.36	8112.63	...	2030.35	7972.47	11.0537	311	14.05	8.3729	314	False	1	14.245753

219 rows × 23 columns

rul10ph = partial_hazard_df[partial_hazard_df.RUL < 11][['RUL', 'log_partial_hazard']]
rul10ph

	RUL	log_partial_hazard
138	10	9.825637
139	9	8.224007
140	8	10.549260
141	7	9.158902
142	6	7.731988
...	...	...
53754	4	7.773097
53755	3	11.300359
53756	2	11.386472
53757	1	14.245753
53758	0	10.462597

2860 rows × 2 columns

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

sns.set(rc = {'figure.figsize':(16,6)})
plt.scatter(partial_hazard_df['RUL'], partial_hazard_df['log_partial_hazard'])

plt.ylabel('log_partial_hazard')
plt.xlabel('RUL')
plt.axhline(y=5,ls="--",c=".5")
xlim = plt.gca().get_xlim()
plt.xlim(xlim[1], xlim[0])
plt.title('Train data - RUL vs. log partial hazard')

plt.show()

png

sns.set(rc = {'figure.figsize':(16,9)})
g = sns.FacetGrid(rul10ph, col="RUL", 
                  #height=12, 
                  #aspect=1.2, 
                  col_wrap=4)
g.map(sns.histplot, "log_partial_hazard")
g.map(plt.axvline,x=10,ls="--",c=".5", color='r',)
plt.show()

png

Simulate Engine Retirement Heuristic before RUL = 0

Training data

%%time
sns.set(rc = {'figure.figsize':(8,5)})

def loghaz_stopheuristic(df, threshold, recurrence_limit, method='consecutive', window=11):
    machinelist = df.machine_name.unique()
    decision_df = pd.DataFrame(index=machinelist)
    df['stop_decision'] = np.nan
    df['stop_decision_when'] = np.nan
    for m in machinelist:
        mask = (df.machine_name == m)
        temp = df[mask & (df.RUL < (window + 1))]
        overcount = 0
        for index, row in enumerate(temp.itertuples(), 1):
            if row.log_partial_hazard > threshold:
                 overcount += 1
            elif method=='consecutive':
                 overcount = 0
            if overcount > recurrence_limit:
                if row.RUL != 0:
                    decision_df.loc[m , 'stop_decision'] = True
                decision_df.loc[m , 'stop_decision_when'] = row.RUL
                #print(f'machine {m} was stopped' )
                break
            
    return df, decision_df
            

result, decisions = loghaz_stopheuristic(partial_hazard_df, threshold=5.5, recurrence_limit=5, method='consecutive', window=16)

print(f'stopped: {decisions.sum()} total: {decisions.shape[0]}')
      
print(f'\nmedian rul: {decisions.stop_decision_when.median()}')
decisions.stop_decision_when.plot(kind='box')
plt.ylabel('RUL')
plt.title('Distribution of engines retired with remaining RUL before failure (RUL = 0)')
plt.show()
decisions.stop_decision_when.plot(kind='hist', bins=35)
#plt.xlabel('RUL')
plt.title('Histogramm of engines retired with remaining RUL')
plt.show()

stopped: stop_decision            260
stop_decision_when    2415.0
dtype: object total: 260

median rul: 11.0

png

CPU times: total: 766 ms
Wall time: 761 ms

Verifiying with testing data

r, decisions_test = loghaz_stopheuristic(partial_hazard_df_test, threshold=5, recurrence_limit=3, method='consecutive', window=30)

print(f'stopped: {decisions_test.sum()} total: {decisions_test.shape[0]} median rul: {decisions_test.stop_decision_when.median()}')
decisions_test.stop_decision_when.plot(kind='box')
plt.ylabel('RUL')
plt.title('Distribution of engines retired with remaining RUL before failure (RUL = 0)')
plt.show()
decisions_test.stop_decision_when.plot(kind='hist', bins=35)
#plt.xlabel('RUL')
plt.title('Histogramm of engines retired with remaining RUL')
plt.show()

stopped: stop_decision            53
stop_decision_when    625.0
dtype: object total: 259 median rul: 10.0

png

decisions_test

	stop_decision	stop_decision_when
1	True	27.0
2	NaN	NaN
3	NaN	NaN
4	NaN	NaN
5	NaN	NaN
...	...	...
255	NaN	NaN
256	NaN	NaN
257	NaN	NaN
258	NaN	NaN
259	NaN	NaN

259 rows × 2 columns

# plt.xlabel callable bugfix
import matplotlib.pyplot as plt
from importlib import reload
plt=reload(plt)

sns.set(rc = {'figure.figsize':(16,11)})
plt.subplot(2,1,1)
#plt.figsize=(15,11)
plt.scatter(partial_hazard_df['RUL'], partial_hazard_df['log_partial_hazard'])

plt.ylabel('log_partial_hazard')
plt.xlabel('RUL')
plt.axhline(y=5,ls="--",c=".5")
xlim = plt.gca().get_xlim()
plt.xlim(xlim[1], xlim[0])
plt.title('Train data - RUL vs. log partial hazard')

plt.subplot(2,1,2)
plt.scatter(partial_hazard_df_test['RUL'], partial_hazard_df_test['log_partial_hazard'], color='g' )

xlim = plt.gca().get_xlim()
plt.xlim(xlim[1], xlim[0])

plt.ylabel('log_partial_hazard')
plt.xlabel('RUL')
plt.title('Test data - RUL vs. log partial hazard')
plt.show()

png

The CoxTimeVaryingFitted model was trained soley with training data of course. It performs well predicting log partial hazard values on the very same machines and is able to distinguish a nearing failure towards RUL=0 with virtually all log partial hazard values above a threshold of roundabout 5 pretty well from cycles that are far from end of RUL.

Predicting the same on the test data is however sobering as the model is seeing also elevated log partial hazard values it more often than not cannot distinguish near RUL zero engines from others. The model does not generalize well enough and is obviously overfitted. We have come to see the limitations of the CoxTimeVaryingFitting method as means of survival analysis and would recommend reapproach this problem with another algorithm and problem framing.

Update: Following up, I have reapproached this predictive maintenance problem again as a time series window classification problem, implemented as Long Short Term Memory (LSTM) type of recurrent neural network (RNN) suited for time series sequence data and its long-range dependencies. Please see here in my projects section: Predictive Maintenance ~ failure prediction with binary classification implemented as Gated Recurrent Units Recurrent Neural Network

Survival Analysis for C-MAPSS turbofan jet engines ~ predictive maintenance

Mathematical representation

Import Data and Preprocessing

Create Censored Data

Exploratory Data Analysis

Check Uniqueness

Correlation Heatmap

Data Preparation for Modeling

Kaplan-Meier Estimator

Compare Survival Time for Each value in categorical columns

Log-Rank Test

Weibull Distribution

Multiple covariates with the Weibull Accelerated Failure Time model (Weibull AFT)

How good is the fit visually?

Cox Propotional Hazard

Non-parametric $h_{0}(t)$ (baseline hazard) : Characterization of how the hazard function changes over time. This baseline hazard can also be interpreted as an intercept in the regression problem

Stratify sm14 and redo straified COXPH fitting

Time to Failure prediction for a new set of turbofan jets engines

Predict Remaining Useful Lifetime (RUL) with a CoxTimeVary Model

Predict And Evaluate

Predicting partial hazards

Simulate Engine Retirement Heuristic before RUL = 0

Training data

Verifiying with testing data