Analysing distributions with more than one mode

We often think of most distributions as having one mode. This includes distributions such as the normal distribution, which is a standard reference in statistics.


Calculate running totals with window functions

When working with a table in SQL, it is often the case that one might wish to aggregate values, or calculate a running total among the values in the table.

In this article, we will investigate how this can be done using what is called a window function.

Additionally, we…


How to group data using Python and SQL

When working with a dataset, it is often the case that the data is not in the necessary format to conduct the appropriate analysis.

For instance, what if we wish to conduct a time series forecast, but there exist many data points over the same time period?

In this example…


Is one feature selection method better than the other?

The purpose of Principal Component Analysis (PCA) is to identify the features that demonstrate the largest amount of variance in a training set.

This is used as a feature selection method to identify the most important attributes that influence the outcome variable — thus allowing for the discarding of variables…


Analysing results from multiple tables

Many of the common queries that we learn in SQL such as GROUP BY are typically used with analysing one table in isolation. It is also common to use a JOIN clause in joining two tables together and treating them as one.

However, there will often be instances where one…


How to handle outliers in a dataset

Traditional linear regression can prove to have some shortcomings when it comes to handling outliers in a set of data.

Specifically, if a data point lies very far away from other points in the set — this can significantly influence the least squares regression line, i.e. …


How views can aid analysis across SQL databases

A view is a virtual table in SQL that functions similarly to a standard table, but does not need to be physically stored, i.e. it is only stored in memory and does not take up actual storage space.

For instance, there will often be times when one would like to…


Using Bayesian Linear Regression to account for uncertainty

Linear regression is among the most frequently used — and most useful — modelling tool.

While no form of regression analysis can ever approximate reality, it can do quite a good job at both making predictions for the dependent variable and determining the extent to which each independent variable impacts…


Determining sales differences across groups

The primary purpose of using an ANOVA (Analysis of Variance) model is to determine whether differences in means exist across groups.

While a t-test is capable of establishing if differences exist across two means — a more extensive test is necessary if several groups exist.

In this example, we will…


Forecasting techniques don’t work well with all time series

When modelling a time series with a model such as ARIMA, we often pay careful attention to factors such as seasonality, trend, the appropriate time periods to use, among other factors.

However, when it comes to using a machine learning model such as XGBoost to forecast a time series —…

Michael Grogan

Data Science Consultant with expertise in economics, time series analysis, and Bayesian methods | michael-grogan.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store