Customer segmentation is a key consideration for any business.

While it may be tempting to look at sales data in isolation, doing so can overlook the fact that different customer segments have different spending patterns, and this means that sales data can vary widely across different groups.

In this regard, using a standard linear regression to quantify the impact of different features on sales can be misleading, as different groups can **exist within each feature** that impact sales in a different way. Therefore, a mechanism is needed to model structured linear relationships in an appropriate way.

There is increasing emphasis on **interpretable machine learning** in the world of data.

Models have been growing ever more complex with the use of neural networks becoming more mainstream, along with the sheer size of data being analysed today.

In many cases, such complex models may not be fit for human interpretation in their own right. Therefore, there has been a push to make the model interpretable, whereby both the results and the process in achieving those results are understood by humans.

In my view, a shortcoming of interpretable machine learning is that it assumes to a degree that the…

*Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice. The findings and interpretations in this article are solely those of the author and are not endorsed by or affiliated with any third-party mentioned in this article. This article is not intended to promote any particular company or product.*

In February of 2020, Forrester predicted that as a result of COVID-19, demand for hardware including computer and communications equipment will be quite weak due to…

When generating random numbers from a particular distribution, this process can be automated to a large extent.

For instance, if one wants to generate 100 random numbers that belong to a normal distribution in R, it is as simple as executing:

`rnorm(100)`

However, how does this process actually work “under the hood”? How can an algorithm know whether a random number belongs to a particular distribution or not?

The answer is through **rejection sampling**.

Rejection sampling is a means of generating random numbers that belong to a particular distribution.

A Cartesian graph consists of x and y-axes across a defined…

The primary purpose of Bayesian analysis is to model data given uncertainty.

Since one cannot access all the data about a population to determine its precise distribution, assumptions regarding the same are often made.

For instance, I might make an assumption regarding the mean height of a population in a particular country. This is a **prior distribution**, or a distribution that is founded on prior beliefs before looking at data that could prove or disprove that belief.

Upon analysing a new set of data (a likelihood function), prior beliefs and the likelihood function can then be combined to form the…

**Rejection sampling** is a means of generating random numbers that belong to a particular distribution.

For instance, let’s say that one wishes to generate 1,000 random numbers that follow a normal distribution. If one wishes to do this in Python using numpy, it is quite a simple execution:

`np.random.randn(1000)`

However, how exactly does this process work? Upon generating random numbers in Python, how can an algorithm know whether a random number belongs to a particular distribution or not? This is where rejection sampling comes in.

There is a reason I provided an image of a darts board at the beginning…

A power law distribution (such as a Pareto distribution) describes the 80/20 rule that governs many phenomena around us.

For instance:

- 80% of a company’s sales often comes from 20% of their customers
- 80% of a computer’s storage space is often taken up by 20% of the files
- 80% of the wealth in a country is owned by 20% of the people

These are just a few examples. While many believe that most datasets tend to follow a normal distribution — power law distributions tend to be a lot more common than we realise. …

I often like to play chess and minesweeper in my spare time (yes, don’t laugh).

Of these two games, I have always found minesweeper more difficult to understand, and the rules of play have always seemed very opaque.

However, the latter game is much more resembling of how situations often unfold in the real world. Here is why that is relevant to data science.

Compare that to chess, where in spite of one’s playing ability — all players have **perfect information** at all times.

One can always see every piece on the board, and neither opponent possesses any informational advantage…

Tools such as Python or R are most often used to conduct deep time series analysis.

However, knowledge of how to work with time series data using SQL is essential, particularly when working with very large datasets or data that is constantly being updated.

Here are some useful commands that can be invoked in SQL to better work with time series data within the data table itself.

In this example, we are going to work with weather data collected across a range of different times and locations.

The data types in the table of the PostgreSQL database are as below:

`…`

For someone who originally comes from an economics background, it might seem quite strange that I would spend some time building models that can predict weather patterns.

I often questioned it myself — but there is a reason for it. **Temperature patterns are one of the easiest time series to forecast.**

When a time series is decomposed — or broken into its individual elements — a series consists of the following components:

**Trend:**The general direction of the time series over a significant period of time**Seasonality:**Patterns that frequently repeat themselves in a time series**Random:**Random fluctuations in…

Data Science Consultant with expertise in economics and time series analysis | michael-grogan.com