We often think of most distributions as having one mode. This includes distributions such as the normal distribution, which is a standard reference in statistics.

When working with a table in SQL, it is often the case that one might wish to aggregate values, or calculate a running total among the values in the table.

In this article, we will investigate how this can be done using what is called a **window function**.

Additionally, we…

When working with a dataset, it is often the case that the data is not in the necessary format to conduct the appropriate analysis.

For instance, what if we wish to conduct a time series forecast, but there exist many data points over the same time period?

The purpose of Principal Component Analysis (PCA) is to identify the features that demonstrate the largest amount of variance in a training set.

This is used as a feature selection method to identify the most important attributes that influence the outcome variable — thus allowing for the discarding of variables…

Many of the common queries that we learn in SQL such as GROUP BY are typically used with analysing one table in isolation. It is also common to use a JOIN clause in joining two tables together and treating them as one.

However, there will often be instances where one…

Traditional linear regression can prove to have some shortcomings when it comes to handling outliers in a set of data.

Specifically, if a data point lies very far away from other points in the set — this can significantly influence the least squares regression line, i.e. …

A **view** is a virtual table in SQL that functions similarly to a standard table, but does not need to be physically stored, i.e. it is only stored in memory and does not take up actual storage space.

For instance, there will often be times when one would like to…

Linear regression is among the most frequently used — and most useful — modelling tool.

While no form of regression analysis can ever approximate reality, it can do quite a good job at both making predictions for the dependent variable and determining the extent to which each independent variable impacts…

The primary purpose of using an ANOVA (Analysis of Variance) model is to determine whether differences in means exist across groups.

While a t-test is capable of establishing if differences exist across two means — a more extensive test is necessary if several groups exist.

In this example, we will…

When modelling a time series with a model such as ARIMA, we often pay careful attention to factors such as seasonality, trend, the appropriate time periods to use, among other factors.

However, when it comes to using a machine learning model such as XGBoost to forecast a time series —…

Data Science Consultant with expertise in economics, time series analysis, and Bayesian methods | michael-grogan.com