# Sampling Methods for Data Science

17th September 2018

We are in the era of big data, where we can collect, store, and process huge amounts like never before. With so much data at our fingertips, the question arises, do we still need to use sampling? It’s a fair question, but considering that we would not always have complete data for an entire population, in which case big data would itself be a sample. Sampling is as relevant now as it ever has been.

In this blog post I’ll detail some of the most common sampling methods along with implementations in both R and Python. I have also provided two Jupyter Notebooks to download for all the code examples below, one Jupyter notebook for R and one Jupyter notebook for Python.

The goal of sampling is to fairly represent the target population. For example, a company may have collected a lot of data on its customers. Depending of the length of the time segment or other factors of collection, they may not have fairly captured all of the customers so sampling would be able to address any imbalances. The data would also be limited to the company’s current area of business, both geographically and commercially. Say the company is thinking about extending into another region, or introducing new products or services, and would like to generalise beyond their current data. This is where sampling would help. It would provide the methods to select representative elements of the population in order to make predictions. So if the company would like to expand into another geographic region, it could sample the current customer data making sure to select the demographics representative of that new region, then run a predictive model to see how successful that expansion would be. In fact it could take multiple samples and run a number of predictive models to compare and contrast.

That said, the use of sampling will not always be appropriate for every data set and it would up to the data scientist to determine it’s usage. As previously mentioned, in cases when the data does not properly represent the target population, or there is a need to split the data in some way.

Sampling can be divided into two areas, probability and non-probability sampling. Non-probability sampling is made up of samples that are collected with no structure in mind such as convienience sampling, where samples are taken because ease of availability, or snowball sampling where one respondent refers the next, and so on. These methods are not properly randomised and would not accurately represent the population so I will not be covering them in this post. Unlike the non-probability methods, probability based sampling seeks to fairly represent the population by giving each member an equal chance of being selected.

## Simple Random Sampling

Simple Random Sampling is the most basic of all the probability methods. It is where each member of the sample is selected at random, independently from other members of the population, and it is usually done without replacement. This means that once chosen, a member is removed from the population and cannot be chosen again. It is useful when there is little advance information about the population, so more advanced methods such as stratified or cluster sampling cannot be used, although it does require a population size of at lease a few hundred to be effective.

To take a simple random sample of five from a population of a thousand without replacement in R, you could use:

```
randomSample = sample(1:1000, 5, replace=FALSE)
print(randomSample)
```

In Python, using the built in random module the equivalent would be:

```
import random
random_sample = random.sample([i for i in range(1, 1001)], 5)
print(random_sample)
```

## Systematic Sampling

Systematic Sampling is where the population is numbered in a list with a random starting point, and the population is then sampled at a fixed interval until the sample is full. The interval is called the sampling interval and is calculated by dividing the sample size by the population. For example if we had a population of one-thousand and wanted a sample size of ten then the sampling interval would be one-hundred and we would progress through the population by steps of one-hundred until the end of the list where we would circle around to the top and carry on until the sample was full.

In R, I have coded an example using the dplyr package. The last two lines could have been done in one line of code, but I’ve separated them for ease of reading.

```
library(dplyr)
# set sample size
sampleSize = 10
# generate numbered data
data <- tibble(
'number' = 1:100,
'id' = paste('Individual ', 1:100)
)
# set the sampling interval
samplingInterval <- round(nrow(data) / sampleSize)
# set a random starting point between 1 and the sampling interval
# so we get the same result without circling around
startingPoint <- sample(1:samplingInterval, 1)
# generate a sequence from the starting point by the sampling interval
sampleList = seq(startingPoint, nrow(data), samplingInterval)
# extract the data
sytematicSample = data[sampleList,]
# print the data
print(sytematicSample)
```

In Python I have used the Pandas and numpy modules along with the built in itertools module.

```
import pandas as pd
import numpy as np
import random
# set sample size
sample_size = 10
# generate numbered data
data = pd.DataFrame({
'number': [i for i in range(1, 101)],
'id': ['Individual ' + str(x) for x in range(1, 101)]
})
# set the sampling interval
sampling_interval = round(len(data) / sample_size)
# set a random starting point between 0 and the sampling interval - 1
# so we get the same result without circling around
starting_point = random.randint(0, sampling_interval - 1)
# generate a sequence from the starting point by the sampling interval
sample_list = np.arange(starting_point, len(data), sampling_interval)
# extract the data
systematic_sample = data[data['number'].isin(sample_list)]
# print the data
print(systematic_sample)
```

## Stratified Random Sampling

Stratified Random Sampling is when the population is divided into groups (strata) based on a certain characteristic, then random samples are taken from those groups so as to to represent the correct proportion of that characteristic within the population. Say we know that 26% of the population owns a pet cat, 31% owns a pet dog, and 43% doesn’t own either, then 26% of our sample would be randomly selected from cat owners, 31% from dog owners and 43% from people with neither.

In R, I have coded an example using the dplyr, plyr, and tibble packages, dplyr and tibble are part of the tidyverse but I have imported them seperately to show precisely what I am using.

```
library(dplyr)
library(plyr)
library(tibble)
# set sample size
sampleSize = 10
# generate numbered data
data <- tibble(
'number' = 1:100,
'OwnsCatOrDog' = c(
rep('cat', 26),
rep('dog', 31),
rep('neither', 43)
)
)
# get list of strata
strata = data %>%
distinct(OwnsCatOrDog) %>%
pull()
# function to take a stratum and randonly sample the correct proportion
stratify <- function(stratum, data, sampleSize) {
# get all members of subgroup
subgroup <- data %>%
filter(OwnsCatOrDog == stratum)
# randomly select correct proportion
subgroupSample <- subgroup %>%
sample_n(round(nrow(subgroup)/nrow(data) * sampleSize))
}
# use lapply to call the stratify function,
# passing the strata, data, and sample size
sampleList <- lapply(strata, stratify, data=data, sampleSize=sampleSize)
# combine the sample list into a single tibble
stratifiedSample <- as.tibble(ldply(sampleList)[-1,])
# print the data
print(stratifiedSample)
```

As I have used the *round()* function to determine the proportion, this code returns a sample of nine instead of the specified ten. In most cases the proportion of the strata will not be an integer so the returned sample size will vary by small amounts. The *floor()* or *ceiling()* functions could be also used to return a smaller or larger sample size respectively.

In Python I have used the Pandas library along with the built in itertools module.

```
import pandas as pd
import itertools as it
# set sample size
sample_size = 10
# generate numbered data
data = pd.DataFrame({
'number': [i for i in range(1, 101)],
'ownsCatOrDog':
['cat'] * 26 +
['dog'] * 31 +
['neither'] * 43
})
# get list of strata
strata = data.ownsCatOrDog.unique()
# function to take a stratum and randonly sample the correct proportion
def stratify(stratum, data, sample_size):
# get all members of subgroup
subgroup = data[data['ownsCatOrDog'] == stratum]
subgroup_sample = subgroup.sample(n=round(len(subgroup)/len(data) * sample_size))
return subgroup_sample
# use map to call the stratify function, passing the strata, data, and sample size
sample_list = map(stratify, strata, it.repeat(data), it.repeat(sample_size))
# combine the sample list into one data frame
stratified_sample = pd.concat(list(sample_list))
# print the data
print(stratified_sample)
```

## Cluster Sampling

Cluster sampling is where a population is divided into groups (clusters) that represent microcosms of the population. A sample is then taken of the clusters with further sampling taken from each of those clusters. Each cluster should be homogenous while the population inside should be as heterogenous as possible. This may sound similar to stratified sampling, but the difference is that in cluster sampling the entire cluster is the sampling unit. Instead of taking the sample from individuals within each strata, the entire cluster is taken.

There are two types of cluster sampling, **one-stage** where all members of the cluster are sampled, and **two stage** where other sampling methods are used to select the individuals from the clusters.

For example, if we wanted a sample of people living in New York City, we could cluster the city by its five boroughs and take our sample of clusters from those boroughs before applying either of one or two-stage methods.

In R I have used the dplyr library again along with figures from a 2017 estimate of the population of the New York City boroughs.

```
library(dplyr)
# set sample size
sampleSize = 3
# generate numbered data
data <- tibble(
'number' = 1:8622698,
'borough' = c(
rep('Manhattan', 1664727),
rep('Brooklyn', 2648771),
rep('Queens', 2358582),
rep('The Bronx', 1471160),
rep('Staten Island', 479458)
)
)
# get sample of clusters as list
sampleClusters <- data %>%
distinct(borough) %>%
sample_n(sampleSize) %>%
pull()
# extract the sampled clusters from the population
clusterSample <- data %>% filter(borough %in% sampleClusters)
# print the first 20 rows of data
head(clusterSample, 20)
```

In Python I have used pandas again along with the random package.

```
import pandas as pd
import random
# set sample size
sample_size = 3
# generate numbered data
data = pd.DataFrame({
'number': [i for i in range(1, 8622699)],
'borough':
['Manhattan'] * 1664727 +
['Brooklyn'] * 2648771 +
['Queens'] * 2358582 +
['The Bronx'] * 1471160 +
['Staten Island'] * 479458
})
# get the available clusters as list
clusters = list(data.borough.unique())
# get sample of clusters
sample_clusters = random.sample(clusters, sample_size)
# extract the sampled clusters from the population
cluster_sample = data[data['borough'].isin(sample_clusters)]
# print the first 20 rows of data
cluster_sample.head(20)
```

## Summary

Sampling techniques seem to be overlooked in the field of data science and I don’t see them covered in many articles on essential data science skills. Big data doesn’t always mean good data, or complete data and as much as cleaning the data is important in order to improve its quality so would sampling be in the right circumstances. The important point to remember, as I’ve mentioned throughout, is that we want to represent the target population as accurately as possible and sampling is another tool that can help us towards this goal.