Table of Contents

A machine learning approach to help detect the early onset of dementia

Loading in all required source code:

Setting a port which will be accessible to a Web GUI:

Data Wrangling

Loading in the longnitudinal dataset

As the data is stored in a CSV, the standard kdb+ method of loading a CSV is used. The column names are transformed to conform with camelCase convention upon loading:

The column header group conflicts with its reserved namesake function in the .q namespace.

The internal .Q.id function is built for dealing with these cases - it will append a 1 to headers that conflict with reserved commands, as well as cleaning header characters that interfere with select/exec/update queries.

However, a custom renameCol function allows a further degree of control by allowing the user to specify what the said column should be renamed as.

The group column is renamed to state:

Extracting target for binary classification

The state feature is extracted into a new target table and dropped from the main data table:


Investigating data structure

The info() method in python provides a concise descriptive summary of a dataframe or table, mainly detailing datatypes, the existence of nulls within attributes and the total number of rows. This can be coded in q to provide the same effect:

As shown above, the data is made up of 10 numerical columns and 4 remaining categorical columns. In machine learning these types of attributes are handled differently. Numerical features are easy to use, given most machine learning algorithms deal with numbers anyway, and generally these dont need transformed except during imputation and standardisation stages. Categorical columns, however, will require modification through encoding and other additional measures.

Another convenient feature in python is the describe() method which returns a set of summary statistics for each column in a dataframe. Similarly, writing in q:

A few things to note:

  • The first few columns are self-explantory and don't need any further elaboration.
  • The q25 and q75 fields correspond to the percentiles - the value below which a cumulative sum of observations fall.
  • The q25 value for age indicates that 25% of patients are younger than 71 years old.
  • The q75 value for age illustrates that 75% of the dataset are younger than 82 years old.
  • STD field is classified as the standard deviation for each column i.e. how dispersed the values are.
  • IQR is the interquartile range.
  • It's apparent that some fields have highly skewed distributions - namely mrDelay and etiv. It is recommended that before standardising or normalising, a log transformation be applied to make these distributions less skewed. This will aid in making the data more interpretable.

Remove zero-variance predictor

The hand column has a single unique value. This is considered an uninformative zero-variance variable which can influence predictions. It is best practice to remove such features:


Split dataset into train and test sets

A common oversight in many machine learning tasks is to perform exploratory data analysis and data preprocessing prior to splitting the dataset into train and test splits. This whitepaper will follow the general principle **[7]**:

    Anything you learn, must be learned from the model's training data

The test set is intended to gauge evaluator performance on totally unseen data. If it affects training in anyway, it can be classified as partially seen. If the full dataset is used during standardisation, the scaler has essentially snooped on data that should've been withheld from training and thus in the process has implicitly learned the mean,median and standard deviation of the testing data by including it in its calculations. This is called data snooping bias.

Resultantly, models will perform strongly on training data at the expense of generalising poorly on test data (a classic case of overfitting).

To circumvent this possibility, the training and testing data is split, using a 80:20 ratio, before pre-processing steps are applied. The seed parameter ensures that the same indices will always be generated for every run to obtain reproducible results.

Looking at the distribution of the target features in the y_train set:

Within the target attribute exists a Converted state which signifies an individual being diagnosed with mild cognitive impairment on a follow-up visit, having initially shown no cognitive impairment prior.

To keep this in the realms of a binary classification problem i.e whether an indiviudal is exhibiting early symptoms of Dementia or not, the Converted values will be transformed to Demented values.

This step will be executed prior to evaluating machine learning algorithm perfomance.


Exploratory data analysis

By visualising data graphically, it is possible to gain important insights that would not be as obvious through eyeballing the data. Exploratory data analysis (EDA) is the practice of describing data visually, through statistical and graphical means, to bring important patterns of that data into focus for further investigation.

There are two main plot libraries in python to aid with visual analysis:

The seaborn library is preferred throughout this whitepaper mainly due to its aesthetics and ability to visualise many different features at once.

Remaining honest, test sets are not used during visual analysis.

To avoid any contamination with training data, a copy of the training data is created which will contain both X_train and y_train tables, joined together using the each-both adverb. The train_copy is deleted from memory after EDA:

Init graph configuration settings:

Many graphs in seaborn use the hue variable a a setting parameter during plotting for colour encoding. In this example, hue is the target attribute state. Resultantly, seaborn will colour the datapoints different for each distinct state value (Nondemented / Demented / Converted). In the below example, factor plots portraying the distribution of nondemented and demented individuals in differing genders are coloured differently - light green signifying the non-demented state, dark green demented and torquoise represents converted patients.

Initial Assumptions:

Next, a count plot is used to display the number of distinct values per column (similar to plotting histograms).

A list of columns is split pairwise using the custom splitList function. Each column pair is then graphically displayed on the same axis:

A few things can be concluded from these graphs about the dataset:

facet grid plots are then computed to portray the variation of Alzheimers as a function of etiv,educ,ses,nwbv, mmse and asf:

The next stage of data visualization involves plotting FacetGrid charts to view the relationship of Demented, Nondemented,Converted diagnoses against a range of different features:

  1. mmse
  2. ses
  3. nwbv
  4. etiv
  5. educ
  6. age

A few observations:

Plotting pairplots to visualise if relations exits among genders:

Data pre-processing

Extracting numerical and categorical features

Splitting the training columns, by datatype, into their own group:

Logarithmic transform

As shown in the describe table there are some fields which are highly skewed and thus have the potential to alter performance. To circumvent this, a log-transform will be applied to each of the skewed columns:

Categorical encoding

get_dummies method

A pivot table is computed to encode the categorical state variables - Converted,Demented,Nondemented - into binary 0 | 1 numeric values.

A lj is then used in dummies to replace Converted,Demented,Nondemented values with the numeric values in the pivot table. The output is assigned as a global dummy table :

Correlation between features

A correlation matrix is computed to gauge if any inherent relationships exist between the features and the target variable. The target state feature is converted into a dummy/indicator numeric variable to convey which attributes share the strongest correlation with it:

The function corrPairs can also be called to display the x highest +ive correlation feature pairs:

From the correlation matrix, that depicts the pearson's r (cor coefficient) between two features, some assumptions can be hedged:

One-hot encoding

One-hot encoding is used to transform/encode categorical values into separate binary values. The term one-hot signifies that there is 1 Hot value in the list, whilst remaining values 0 are cold.

This is desired as most machine learning algorithms perform better when dealing with non-categorical values.

Thus, mF categorical values are hot-one encoded:

Imputation

Handling missing features within machine learning is paramount. For various reasons, many real-world data sets come with missing values, often encoded as blanks, NaNs, or other placeholders. Many algorithms cannot work with missing features and thus they must be dealt with. There's varying options to combat this:

The function impute takes a table reference and a method (med, avg, mode) for which to infer the missing values.

Replacing missing features with the columns' median value is executed below:

The values within the ses column are reversed to convey the correct information to the model.

Currently, a low ses value (1 or 2) represents a high economical status whilst a high ses value (4 or 5) depicts a low economical status.

Values are reversed ( 1->5, 2->4, 4->2, 5->1) so that the higher the score, the higher income an individual received:

Investigating and dealing with outliers

An outlier table is defined to track any outlier values and their indices:

The outlierDetect function uses the z-score method to detect if any datapoint has a standard score greater than a threshold value 3.

The z-score defines how many standard deviations a datapoint is away from the mean.

Anamalous values are now persisted within the global outliers table:

These outliers can be visualised by graphing whiskerplots:

NOTE: Outliers can also be visualised in 3D space using the Isolation Forest technique. Please refer to the appendix: Using Isolation Forest to view outliers in 3-D space

Dealing with outliers is imperative in machine learning as they can significantly influence the data and thus add extra bias during evaluation. Some possible solutions to circumvent the effect of outliers:

  1. Keep the outliers in the dataset.
    • If the dataset is small,which it is in this case, it might be more costly to remove any rows as we lose information in terms of the variability in data **9**.
  2. Remove the outlier rows from the training data.
  3. Apply a log-transform to each outlier.
  4. Clip each outlier to a given value such as the 5th or 95th percentile - a process called Winsorization **9**.

The outlierTransform function is called below, with the winsorize transform so that each outlier value is replaced with the 5th and 95th percentile respectively:

Gathering training statistics

A training statistics table, trainingStats, is defined inside mlf.q to capture important metrics such as:

These metrics will be used to transform unseen test data:

Drop irrelevant & collinear features

Colinear and features are removed from the dataset:

Standardising or Normalising

In machine learning it is generally a requirement that for algorithms to perform optimally, features should be standardised or normalised.

When standardising a dataset, features are rescaled so that they have the properties of a standard normal distribution with

$$μ = 0$$

$$σ = 1$$

where μ is the mean (average) and σ is the standard deviation from the mean i.e. they are centered around 0 with a std of 1 **10**.

The z-scores of each datapoint can then be computed using:

$$z = \frac{x-μ }{σ }$$

Which can be written simply in q as: $$ q) stdScaler:{(x-avg x)%dev x} $$

An alternative approach is to use normalisation (often called min-max scaling).

When using this method, data is scaled to a fixed range - usually 0 to 1. The cost of having this fixed range is that we will end up with smaller standard deviations, which can suppress the effect of outliers **10**.

Min-Max scaling is performed using the following equation:

$$Xnorm = \frac{X-Xmin}{Xmax-Xmin }$$

Which also can be written in q as: $$ q) minMaxScaler:{(x-m)%max[x]-m:min x} $$

There's no obvious answer when choosing standardisation or normalisation.

Both scaling transforms are executed on the training split and their outputs are visualised below:

From the above graphs, Standardisation is chosen over normalisation.

This scaler is then permanently applied to the training dataset via a functional apply:

Next, several functions are added to the trainingStats table so that transformations can be easily reproduced on any dataset. These functions:

Each function is lightweight - a projection with computed training max,min,median,mean or standard deviation values.

This means transforming unseen data becomes relatively simple:

For example, unseen visit values can be standardised, normalised or imputed referencing the below projections:

Encode target group to numerical values

Since this is a binary classification problem, the Converted values in the y_train and y_test datasets are substituted to be Demented. In addition to this, these datasets are encoded, using a vector conditional, into numerical values where:

0 = Nondemented

1 = Demented

After encoding the target classes, the distribution of classes in the target dataset is checked:

Overall, there are 12 more 0 than 1 classes.

Using a machine learning estimator out of the box when classes aren't evenly distributed can be problematic.

To address this imbalanced class issue, new examples of the minority 1 class will be synthesised using the SMOTE technique.

Smote (Synthetic Minority Over-sampling) connects the dots between minority classes, and along these connections, creates new synthetic minority classes.

This technique is applied after the kdb+ tables are transformed into python arrays.

Shuffle training columns

The order of columns are shuffled pre-evaluation:

Pipe transform

There are many transformation steps that need to be executed (in the correct order) before the data can be evaluated by machine learning algorithms.

In Scikit-learn, a pipeline class is usually used to assist in sequences of transformations.

To help automate the above machine learning workflow, feature engineering and data cleaning steps are grouped into a pipeline function:

Similar to the X_train dataset, the order of columns are shuffled pre-evaluation:

The kdb+ tables are then transformed into Python-readable matrices:

Note, to perform the reversal transformation i.e. a python-array to kdb+ table, run the array2Tab function.

Now, as alluded to already, the class imbalanced problem is addressed using the SMOTE technique to generate some minority 1 classes:

Evaluation

The No Free Lunch theorem in Machine Learning stipulates that there is not one algorithm which works perfectly for all dataset **7**. Thus, the performance of differing machine learning classifications will be computed and compared.

The main evaluation metric to gauge algorithm performance is the AUC (Area under ROC curve) score.

The ROC curve displays the trade-off between the true positive rate and the false positive rate [6]. In the case for diagnosing Dementia, it’s imperative that patients who exhibit symptoms are identified as early as possible (high true positive rate) whilst healthy patients aren’t misdiagnosed with Dementia (low false positive rate) and begin treatment. AUC is the most appropriate performance measure as it will aid in distinguishing between the two diagnostic groups(Demented/ Nondemented) [7].

Other evaluation metrics are used to compliment the AUC score but don’t carry the same weight.

These include:

  1. Cross-Validation score (or gridsearch score).
  2. Recall score - The ratio of positive instances that each of our models detect.
  3. Diagnostic odds ratio (DOR score) is the odds of positivity in individuals with a illness relative to the odds in individuals without an illness **11**.

The evaluateAlgos function calls classReport which computes all of the above evaluation metrics - AUC scores, classification reports (containing recall score) and a DOR score. This function will be called for a series of machine learning algorithms.

Before evaluation, a global scores table is defined that will record important metrics that are used to evaluate the validity of each model and whether that model is using default parameters or hyperparameters that have been tuned via grid-search or randomized grid seach:

Note:

Linear classifiers are evaluated using default parameters first:

Next, non-linear classifiers with default parameters:

Finally, tree based classifiers:

First off, most models with default parameters provided a significant improvement over the baseline model indicating that applying Machine learning techniques to this dataset is applicable.

It is apparent that all models suffer from overfitting, a crux of having a small dataset. They score strongly in training predictions but generalise poorly on unseen data (overfitting!).

Please refer to follow section in the appendix to see the effects of Overfitting in SVM

The next steps will try and circumvent this issue by:

Using the Boruta Algorithm to remove irrelevant features and only retain features that fall within an area of absolute acceptance.

Using the GridSearch and RandomizedGridSearch techniques with cross validation to finely tune hyperparameters that could unknowlingly exasperate overfitting.

Feature selection

Convert the arrays back into q tables:

Feature selection is the process of finding a subset of features in the dataset X which have the greatest discriminatory power with respect to the target variable y.

If feature selection is ignored:

Ideally, instead of manually going through each feature to decide if any relationship exists between it and the target, an algorithm that is able to autonomously decide whether any given feature of X bears some predictive value about y is desired.

This is what the Boruta algorithm does.

A Random Forest algorithm is fitted on X and y. The feature importance is extracted from the RF model and only features that are above a threshold of importance are retained **12**. Digging deeper:

This algorithm can be written in q as shown below.

The iteration count is arbitrary. The user provides a list of integers where:

So in the below case, 80 runs are executed against the training dataset X_train with the random seed value iterating each run (starting at 1, finishing at 80). The user decides how many features they extract from the area of acceptance (3 in this case):

Important features [nwbv mmse educ] are extracted and kept using the Boruta Algorithm.

The remaining features are dropped:

Train and test sets are converted back to python arrays:

Hyperparameter Tuning

In order to further improve the AUC score for each model, the hyperparameters for each classifier are optimized using one of the following techniques:

  1. GridSearch
  2. RandomizedSearchCV

GridSearch simplifies hyperparameter implementation & optimization. A dictionary which contains the hyperparameters that need to be experimented are passed to a GridSearchCV function which will evaluate all possible combinations of hyperparameter values using cross validation on a model of the users choosing **7**.

GridSearch is ideal when the combinations that you wish to explore aren’t plentiful, but whenever the hyperparameter space is large, it is preferable to make use of RandomizedSearchCV. This class works much like GridSearch but with one major difference – instead of experimenting on all possible parameter combinations, randomized search will evaluate a fixed number of hyperparameter combinations sampled from specified probability distributions **13**. It is generally preferred over GridSearch as the user has more control over the computational budget by setting the number of iterations **7**.

An optimalModels key table is defined that will tabulate the optimal parameters found using the grid-search/randomized grid-search technique for a particular classifier.

An optimal model will then be used by a web application to predict whether an individual is displaying alzheimer symptoms.

A series of dictionaries are defined that represent the parameter space for each algorithm:

A dictionary is defined to map each parameter space to its algorithm:

All models hypertuned via RandomizedGridSearchCV computed higher AUC scores than the previous no tuning evaluation, an indication that hyperparameter tuning via RandomizedSearchCV coupled with feature selection techniques, improved performance significantly (the closer to 1, the better):

The model with the highest AUC score is the SVM algo.

The SVM model is relatively simple compared to the boosting & ensemble algorithms which suggest that these results reinforce the principle of Occam's razor where given a small dataset, the application of simple models, with the fewest assumptions, yields the best results **7**.

Although, there is still some overfitting happening (training acc > test acc) mainly due to the size of this dataset, it is not as consequential as previous - the training accuracies have decreased significantly whilst the test accuracies have risen substantially.

In essence, the models aren't learning as many particulars in the training dataset and therefore generalising better on unseen data.

Previously, models were learning details and noise in the training data to the extent that it was generalising very poorly on unseen data.

Set the variable modelsTrained to true signalling to the GUI that the models have been trained and are ready to predict Dementia scores:

Creating a web GUI to predict unseen cases

Since WebSockets are supported by most modern browsers, it is possible to create a HTML5 real-time web application (GUI) and connect it to this kdb+ process through JavaScript.

A detailed introduction to websockets can be found in the Kx whitepaper Kdb+ and WebSockets: https://code.kx.com/q/wp/websockets/.

This simple GUI will serve as the entry-point for a potential client who wants to ascertain, by entering some data input, if an individual is exhibiting any symptoms of dementia.

Customising .z.ws

Firstly, make sure some arbritary port is open so the web application can connect (should be port 9090 by default):

kdb+ has a separate message handler for WebSockets called .z.ws, meaning all incoming WebSocket messages will be processed by this function.

Note, there is no default definition of .z.ws, it has been customised (in this case) to take biomarker/mri data and to return a result that indicates the possibility an individual is Demented (>0.5 - the closer to 1, the greater the confidence a subject is demented).

Breaking down each step of this custom .z.ws function.

Refer to the README - Using the web GUI for example usage

Conclusion

This notebook demonstrates that a kdb+ approach can be applied to a machine learning classification task to produce logical results. A multitude of techniques were reproduced in q to replicate their python equivalent. These included data wrangling & exploratory data analysis steps, feature engineering techniques and evaluation metrics (AUC score, classification reports etc.) to gauge model performance.

Given the dataset was small, models trained on this set, ran the risk of overfitting where models were more susceptible to seeing patterns that didn't exist, resulting in high variance and poor generalisation on a test dataset.

Thus, to circumvent the effect of overfitting, outliers were transformed and cross-validation techniques were applied.

Initially, model performance was poor with simple models such as LinearDiscriminantAnalysis and LogisticRegression winning out (complex models with many parameters increased the likelihood of overfitting).

To improve performance, feature selection coupled with grid search & RandomizedSearchCV techniques resulted in AUC score increasing by 20% across the board, with the best performing algorithm being the Support vector machine model.

Finally, a HTML5 web application was created to serve as a entry-point for a potential user who wants to obtain, by entering some data fields, if an individual is showing any symptoms of dementia.

Appendix

Overfitting in SVM

To exhibit the effects of parameter values influencing overfitting, the iris dataset, one of the most renowned machine learning datasets, is used to demonstrate tuning the hyperparameters of a support vector machine.

The hyperparameters are:

The iris dataset is imported from sklearn's internal datasets library:

The plotSVC function allows decision boundaries in SVMs to be visualised for different parameter values:

An overfit function can be called to iterate over each hyperparameter value to exhibit the effects that different values have on parameters.

The parameters investigated are:

N.B. The effects of C and gamma parameters are studied with a radial basis function kernel (RBF) applied to a SVM classifier.

Looking at the decision boundaries for different kernels:

Decision boundaries for different regularization values(C):

Finally, comparing boundaries for different gamma values:

Using Isolation Forest to view outliers in 3D space

Another method of detecting anomalies is using the Isolation Forest algorithm. It isolates outliers by selecting a random feature and then computing a split value between the max and min values of the said feature. This random partitioning results in shorter paths from anomalous data points which makes them distinguishable from other data points.

These outliers are then visualised in 3-D space to further reinforce the influence these datapoints can exert on the dataset:

Machine Learning algorithms

Pros vs Cons

Logistic Regression:

Model Pros Cons
Logistic Regression Due to its simplistic nature and easy implementation,it is an ideal baseline for any binary classification problem. This model is prone to overfitting and doesn’t fare well on independent variables that are in no way correlated with the target variable.
Following the reasoning of Occam’s Razor, given the size of the MRI dataset,the application of simple models could yield the best results.
It also doesn’t require much computational power and doesn’t require the scaling of features(quicker fitting time)

Support vector machines:

Model Pros Cons
SVM Can handle the case where the relationship that exists between the features and target are non-linear (kernel-trick) Memory intensive
Has few hyperparameters to tune– in the case of this dataset, C and gamma will be the hyperparameters that will need tuning as the ‘rbf’ kernel will be used. Can be long fitting times

Naive Bayes:

Model Pros Cons
Naive Bayes The training set is small and thus high bias/low variance classifiers (i.e. Naïve bayes) should have an advantage over complex models which may have a tendency to overfit. Has trouble learning the interaction between different features
Extremely simple to implement.

Decision Trees:

Model Pros Cons
Decision Trees Decision trees require very little data preparation i.e. don’t require feature scaling. They are the fundamental concept behind the Random Forest model Main disadvantage is that an increase in variance which leads to poor generalization (tendency to overfit).
Decision trees are fairly intuitive and thus their decisions are easy to interpret i.e. they provide simple classification rules that can be applied manually if need be (known as ‘White box’ modelling). Small variations in data can result in different decision trees

Random Forests:

Model Pros Cons
Random Forests The random forest model is commonly referred to as the ‘Leatherman’ of learning methods and thus can be fitted to most regression and classification tasks.[20] ‘Black box’ model that is very hard to interpret (in comparison to decision trees).
Although one should undertake explicit efforts to avoid overfitting (cross validation etc) as not every algorithm is immune to overfit, RF’s are less likely to overfit A large number of trees may make the model slow down when making predictions
It can handle a large number of features and can help estimate which features are particularly important in the underlying data

Adaboost:

Model Pros Cons
Adaboost It's simple to implement Can be sensitive to noisy data or data which contains outliers
Not overly prone to overfitting

Gradient Boosting:

Model Pros Cons
Gradient Boosting Can produce high prediction accuracies Computationally expensive
It can work on datasets that have missing features It can cause overfitting due to the model trying to continually minimise all errors
Few preprocessing steps need to be implemented as it can handle both numerical and categorical data

Sources

Sources can be found in the accompanying README.md file