Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice. The findings and interpretations in this article are solely those of the author and are not endorsed by or affiliated with any third-party mentioned in this article. This article is not intended to promote any particular company or product.
In February of 2020, Forrester predicted that as a result of COVID-19, demand for hardware including computer and communications equipment will be quite weak due to…
When generating random numbers from a particular distribution, this process can be automated to a large extent.
For instance, if one wants to generate 100 random numbers that belong to a normal distribution in R, it is as simple as executing:
rnorm(100)
However, how does this process actually work “under the hood”? How can an algorithm know whether a random number belongs to a particular distribution or not?
The answer is through rejection sampling.
Rejection sampling is a means of generating random numbers that belong to a particular distribution.
A Cartesian graph consists of x and y-axes across a defined…
The primary purpose of Bayesian analysis is to model data given uncertainty.
Since one cannot access all the data about a population to determine its precise distribution, assumptions regarding the same are often made.
For instance, I might make an assumption regarding the mean height of a population in a particular country. This is a prior distribution, or a distribution that is founded on prior beliefs before looking at data that could prove or disprove that belief.
Upon analysing a new set of data (a likelihood function), prior beliefs and the likelihood function can then be combined to form the…
Rejection sampling is a means of generating random numbers that belong to a particular distribution.
For instance, let’s say that one wishes to generate 1,000 random numbers that follow a normal distribution. If one wishes to do this in Python using numpy, it is quite a simple execution:
np.random.randn(1000)
However, how exactly does this process work? Upon generating random numbers in Python, how can an algorithm know whether a random number belongs to a particular distribution or not? This is where rejection sampling comes in.
There is a reason I provided an image of a darts board at the beginning…
A power law distribution (such as a Pareto distribution) describes the 80/20 rule that governs many phenomena around us.
For instance:
These are just a few examples. While many believe that most datasets tend to follow a normal distribution — power law distributions tend to be a lot more common than we realise. …
I often like to play chess and minesweeper in my spare time (yes, don’t laugh).
Of these two games, I have always found minesweeper more difficult to understand, and the rules of play have always seemed very opaque.
However, the latter game is much more resembling of how situations often unfold in the real world. Here is why that is relevant to data science.
Compare that to chess, where in spite of one’s playing ability — all players have perfect information at all times.
One can always see every piece on the board, and neither opponent possesses any informational advantage…
Tools such as Python or R are most often used to conduct deep time series analysis.
However, knowledge of how to work with time series data using SQL is essential, particularly when working with very large datasets or data that is constantly being updated.
Here are some useful commands that can be invoked in SQL to better work with time series data within the data table itself.
In this example, we are going to work with weather data collected across a range of different times and locations.
The data types in the table of the PostgreSQL database are as below:
…
For someone who originally comes from an economics background, it might seem quite strange that I would spend some time building models that can predict weather patterns.
I often questioned it myself — but there is a reason for it. Temperature patterns are one of the easiest time series to forecast.
When a time series is decomposed — or broken into its individual elements — a series consists of the following components:
Average daily rates (henceforth referred to as ADR) represent the average rate per day paid by a staying customer at a hotel.
This is an important metric for a hotel, as it represents the overall profitability of each customer.
In this example, average daily rates for each customer are averaged over a weekly basis and then forecasted using an ARIMA model.
The below analysis is based on data from Antonio, Almeida and Nunes (2019): Hotel booking demand datasets.
In this particular dataset, the year and week number for each customer (along with each customer’s recorded ADR value) is provided separately.
…
In this example, XGBRegressor is used to predict kilowatt consumption patterns for the Dublin City Council Civic Offices, Ireland. The dataset in question is available from data.gov.ie.
Have you used XGBoost (Extreme Gradient Boosting) for classification tasks before? If so, you will be familiar with the workings of this model.
Essentially, a gradient boosting model works by adding predictors to an ensemble in a sequential fashion, with the new predictor being fit to the residual errors made by the previous predictor. …
Data Science Consultant — Expertise in time series analysis, statistics, Bayesian modeling, and machine learning with TensorFlow | michael-grogan.com