Bike Sharing Analysis

Table of contents

  1. Introduction
  2. Import libraries, load and transform data
  3. A look into the data
  4. Prediction model and interpretation
  5. Prediction function for daily usage
  6. Results

1. Introduction

Bike sharing companies are on the rise since years and are very popular in the metropolitan areas around the world. Reliability, availability and user-friendly processes are important aspects of the business. To provide constant quality and availabilty to the customer it is crucial to understand the significant factors of demand. Since every bike needs maintenance from time to time you better know when you can spare some of your capacity and when it is highly demanded.

The following notebook provides:

  • a general analysis of the Bike Sharing Dataset from this website: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
  • a prediction model for the utilization of bikes per hour with insights about the factors and main effects
  • an operative function to predict the utilization for future days.

With the help of the tool provided, managers are able to plan their maintenance phases and investments more efficiently.

The problem with regression models which estimate mean values is that approximately half of all cases you will be above the estimates. To provide a high level of availability and reliability for the customer it is benificial to use upper bounds of estimated utilization hence the chosen regression method is a quantile regression model. This product allows the company to prepare for upcoming demands with respect to a desired level of reliability (e.g. 90%).

It will turn out that the most significant effect comes from customers using bikes in the morning and afterwork hours on workdays. It is recommended to plan with respect to these critical moments to obtain a high level of customer satisfaction. Ending up with no bike in the morning hours can put customers into difficult situations with a high risk of bad reputation for the company - a good ground for competitors.

2. Import libraries, load and transform data

In [1]:
import numpy as np
import pandas as pd
import math as ma
import matplotlib.pyplot as plt
% matplotlib inline
from matplotlib import mlab
import datetime

The data and attribute info from the website

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

  • instant: record index
  • dteday : date
  • season : season (1:springer, 2:summer, 3:fall, 4:winter)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • hr : hour (0 to 23)
  • holiday : weather day is holiday or not (extracted from [Web Link])
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
  • weathersit :
    1. Clear, Few clouds, Partly cloudy, Partly cloudy
    2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
  • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
  • hum: Normalized humidity. The values are divided to 100 (max)
  • windspeed: Normalized wind speed. The values are divided to 67 (max)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered
In [ ]:
 
In [2]:
day = pd.read_csv("day.csv", sep = ',')
hour = pd.read_csv("hour.csv", sep = ',')
day['dteday'] = pd.to_datetime(day['dteday'])
hour['dteday'] = pd.to_datetime(hour['dteday'])
hour['timestamp'] = pd.to_datetime(hour['dteday']) + hour['hr'].astype('timedelta64[h]')
hour['time'] = hour['dteday'] - hour['dteday'].min() # prepare timestamp and count days for prediction
hour['days'] = hour['time'].dt.days 

3. A look into the data

First thoughts

The dataset contains no missing values, appears to be very clean, has standardized measures and no obvious outliers. I had no doubt that it is a ready to use data set.

Let us have a look at the time series of the utilization ('cnt' variable) per day and per hour.

In [3]:
count_by_day = day[['dteday','cnt']].set_index('dteday')
plt.figure()
count_by_day.plot(figsize=(15,5))
plt.legend(loc='best')
Out[3]:
<matplotlib.legend.Legend at 0x1130d7358>
<matplotlib.figure.Figure at 0x10a7cd7f0>
In [4]:
count_by_hour = hour[['timestamp','cnt']].set_index('timestamp')
plt.figure()
count_by_hour.plot(figsize=(18,10))
plt.legend(loc='best')
Out[4]:
<matplotlib.legend.Legend at 0x11bf871d0>
<matplotlib.figure.Figure at 0x1222352e8>

Thoughts on the time series

Both resolutions of the utilization time series show similar behaviour. There is periodicity over the year. In summer time more bikes are used than in winter time. It is an increasing and then decreasing progress. Very important here is that the level of Jan 2011 is almost the same as in Dec 2011, Jan 2012 and Dec 2012. The plateau during summer 2012 is higher than in 2011. I conclude that there might have been an investment in beginning of 2012 and the capacity has been scaled up by a certain factor. I would rather not speak of a significant increasing trend over time, and the linear models I tried did not show a high regression coefficient for the time aspect. So let us further assume that changes of the system like investments explain the difference between time periods. Since there is no information in the data about the total amount of bikes available we can only speculate about this. The absolute maximum values per year could give us some estimate which supports the theory. All changes of the system should be taken into account when forecasting and should be tracked by (categorical) indicators in the data to estimate a proper mean for the new time period.

The top 20 days

  1. All top 20 days are from year 2012
  2. Top days can occur in every season but half of the top 20 days are from autumn.
  3. No holiday is under the top 20 days.
  4. Top days can be working days and not.
  5. Top 20 days are never with bad weather conditions.
In [5]:
day.sort_values(by = ['cnt'], ascending = False).head(20)
Out[5]:
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
623 624 2012-09-15 3 1 9 0 6 0 1 0.608333 0.585867 0.501667 0.247521 3160 5554 8714
637 638 2012-09-29 4 1 9 0 6 0 1 0.542500 0.530296 0.542917 0.227604 2589 5966 8555
630 631 2012-09-22 3 1 9 0 6 0 1 0.650000 0.610488 0.646667 0.283583 2512 5883 8395
447 448 2012-03-23 2 1 3 0 5 1 2 0.601667 0.570067 0.694167 0.116300 2469 5893 8362
504 505 2012-05-19 2 1 5 0 6 0 1 0.600000 0.566908 0.456250 0.083975 3410 4884 8294
617 618 2012-09-09 3 1 9 0 0 0 1 0.610000 0.578925 0.547500 0.224496 2570 5657 8227
571 572 2012-07-25 3 1 7 0 3 1 1 0.724167 0.654054 0.450000 0.164800 1383 6790 8173
629 630 2012-09-21 3 1 9 0 5 1 1 0.599167 0.571971 0.668750 0.154229 1250 6917 8167
643 644 2012-10-05 4 1 10 0 5 1 1 0.615000 0.580187 0.627500 0.106350 1516 6640 8156
518 519 2012-06-02 2 1 6 0 6 0 1 0.583333 0.566288 0.549167 0.186562 2795 5325 8120
658 659 2012-10-20 4 1 10 0 6 0 1 0.484167 0.472842 0.572917 0.117537 2806 5284 8090
622 623 2012-09-14 3 1 9 0 5 1 1 0.633333 0.594083 0.672500 0.103863 1379 6630 8009
644 645 2012-10-06 4 1 10 0 6 0 1 0.554167 0.538521 0.664167 0.268025 3031 4934 7965
631 632 2012-09-23 4 1 9 0 0 0 1 0.529167 0.518933 0.467083 0.223258 2454 5453 7907
620 621 2012-09-12 3 1 9 0 3 1 1 0.599167 0.570075 0.577083 0.131846 1050 6820 7870
595 596 2012-08-18 3 1 8 0 6 0 1 0.678333 0.618071 0.603333 0.177867 2827 5038 7865
665 666 2012-10-27 4 1 10 0 6 0 2 0.530000 0.515133 0.720000 0.235692 2643 5209 7852
441 442 2012-03-17 1 1 3 0 6 0 2 0.514167 0.505046 0.755833 0.110704 3155 4681 7836
621 622 2012-09-13 3 1 9 0 4 1 1 0.612500 0.579558 0.637083 0.082721 1054 6750 7804
619 620 2012-09-11 3 1 9 0 2 1 1 0.577500 0.554292 0.520000 0.092054 1070 6697 7767

Lets consider the distribution of utilization (per day and per hour) over the years and some categorical variables.

In [6]:
fig, axes = plt.subplots(2,3, figsize=(15,10)) # create figure and axes


day.boxplot(column='cnt',by=['yr','season'], ax=axes.flatten()[0] )

day.boxplot(column='cnt',by=['yr','mnth'], ax=axes.flatten()[1])

day.boxplot(column='cnt',by=['yr','weekday'], ax=axes.flatten()[2])

day.boxplot(column='cnt',by=['yr','workingday'], ax=axes.flatten()[3])

day.boxplot(column='cnt',by=['yr','holiday'], ax=axes.flatten()[4])

day.boxplot(column='cnt',by=['yr','weathersit'], ax=axes.flatten()[5])

#fig.delaxes(axes[1,1]) # remove empty subplot
plt.tight_layout() 

plt.show()

thoughts

Boxplots from left to right row-wise:

  1. spring is the season with the lowest daily counts for each year and summer is the highest. The relations between the seasons is alike for both years.

  2. The month seems to give a better information than the season. (This graph looks like there is a positive trend over time, but the linear models did not confirm it)

  3. The days of the week could play a role, especially in year 2012. But it is not strong.
  4. Workingday information is weak with respect of daily total utilization.
  5. Holidays tend to have a slightly lower daily utilization compared to non-holidays.
  6. The worse the weather situation the lower the daily utilization.

Next step: Have a look at the data per hour and focus on the year 2012 only since the overall behaviour is very similar.

Top 20 hours

  1. All of them are form 2012
  2. Season is most likely autumn or early winter
  3. Top 20 hours are always on working days with good weather conditions during early after work.
In [7]:
hour.sort_values(by = ['cnt'], ascending = False).head(20)
Out[7]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt timestamp time days
14773 14774 2012-09-12 3 1 9 18 0 3 1 1 0.66 0.6212 0.44 0.2537 91 886 977 2012-09-12 18:00:00 620 days 620
14964 14965 2012-09-20 3 1 9 17 0 4 1 1 0.64 0.6212 0.50 0.2239 91 885 976 2012-09-20 17:00:00 628 days 628
14748 14749 2012-09-11 3 1 9 17 0 2 1 1 0.70 0.6364 0.28 0.0000 168 802 970 2012-09-11 17:00:00 619 days 619
14725 14726 2012-09-10 3 1 9 18 0 1 1 1 0.62 0.6212 0.35 0.2985 111 857 968 2012-09-10 18:00:00 618 days 618
15084 15085 2012-09-25 4 1 9 17 0 2 1 1 0.66 0.6212 0.39 0.2836 107 860 967 2012-09-25 17:00:00 633 days 633
15780 15781 2012-10-24 4 1 10 17 0 3 1 1 0.66 0.6212 0.47 0.0000 87 876 963 2012-10-24 17:00:00 662 days 662
10622 10623 2012-03-23 2 1 3 17 0 5 1 2 0.72 0.6515 0.42 0.1642 264 693 957 2012-03-23 17:00:00 447 days 447
15108 15109 2012-09-26 4 1 9 17 0 3 1 1 0.74 0.6667 0.48 0.2985 77 876 953 2012-09-26 17:00:00 634 days 634
15444 15445 2012-10-10 4 1 10 17 0 3 1 1 0.58 0.5455 0.43 0.2239 91 857 948 2012-10-10 17:00:00 648 days 648
15588 15589 2012-10-16 4 1 10 17 0 2 1 1 0.52 0.5000 0.39 0.1940 104 839 943 2012-10-16 17:00:00 654 days 654
14293 14294 2012-08-23 3 1 8 18 0 4 1 1 0.74 0.6667 0.48 0.2239 130 811 941 2012-08-23 18:00:00 600 days 600
15756 15757 2012-10-23 4 1 10 17 0 2 1 1 0.68 0.6364 0.41 0.0000 67 871 938 2012-10-23 17:00:00 661 days 661
14772 14773 2012-09-12 3 1 9 17 0 3 1 1 0.70 0.6364 0.41 0.2985 114 811 925 2012-09-12 17:00:00 620 days 620
15732 15733 2012-10-22 4 1 10 17 0 1 1 1 0.62 0.6212 0.33 0.1045 84 838 922 2012-10-22 17:00:00 660 days 660
15276 15277 2012-10-03 4 1 10 17 0 3 1 1 0.70 0.6515 0.65 0.0000 84 833 917 2012-10-03 17:00:00 641 days 641
13596 13597 2012-07-25 3 1 7 17 0 3 1 1 0.80 0.6970 0.31 0.0000 130 783 913 2012-07-25 17:00:00 571 days 571
15132 15133 2012-09-27 4 1 9 17 0 4 1 2 0.66 0.6212 0.69 0.2985 87 818 905 2012-09-27 17:00:00 635 days 635
15300 15301 2012-10-04 4 1 10 17 0 4 1 1 0.70 0.6515 0.51 0.2239 112 789 901 2012-10-04 17:00:00 642 days 642
15324 15325 2012-10-05 4 1 10 17 0 5 1 1 0.70 0.6364 0.42 0.1642 158 742 900 2012-10-05 17:00:00 643 days 643
14965 14966 2012-09-20 3 1 9 18 0 4 1 1 0.60 0.6212 0.56 0.2537 119 781 900 2012-09-20 18:00:00 628 days 628
In [8]:
fig, axes = plt.subplots(2,2, figsize=(15,10)) # create figure and axes


hour[hour['yr']==1].boxplot(column='cnt',by=['hr'], ax=axes.flatten()[0])

hour[hour['yr']==1].boxplot(column='cnt',by=['season','hr'], ax=axes.flatten()[1])

hour[hour['yr']==1].boxplot(column='cnt',by=['weekday','hr'], ax=axes.flatten()[2])

hour[hour['yr']==1].boxplot(column='cnt',by=['workingday','hr'], ax=axes.flatten()[3])

#hour.boxplot(column='cnt',by=['yr','holiday'], ax=axes.flatten()[4])

#hour.boxplot(column='cnt',by=['yr','weathersit'], ax=axes.flatten()[5])

#fig.delaxes(axes[1,1]) # remove empty subplot
plt.tight_layout() 

plt.show()

thoughts

Boxplots from left to right row-wise:

  1. Morning and afternoon hours have significant peaks over the day. Night and afternoon hours have an interesting ammount of outliers.

  2. The peaks remain for every season of the year. This is a characteristic profile.

  3. The weekday has some impact on the hourly profile. On Saturdays and Sundays people use bikes around noon. This also explains the outliers in the first figure. Eureka!
  4. The comparison of working days and non-working days is consistent with the former observations and supports the theory that people tend to use bikes more frequently at noon and in the late night on free days.
In [12]:
# Split utilization into regsitered and casual users

fig, axes = plt.subplots(1,2 , figsize=(15,5)) # create figure and axes

hour[hour['yr']==1].boxplot(column='registered',by=['workingday','hr'] , figsize=(7,5), ax=axes.flatten()[0])

hour[hour['yr']==1].boxplot(column='casual',by=['workingday','hr'] , figsize=(7,5), ax=axes.flatten()[1] )


#fig.delaxes(axes[1,1]) # remove empty subplot
plt.tight_layout() 

plt.show()

conclusions

On workingdays people use bikes to get to and from work where on weekends (and holidays) people use them more during noon and late night hours. Other factors like weather and time of the year affect the baseline of the utilization since a proportion of people might avoid biking in winter and bad weather in general.

Now let us have a look at the remeining variables and their pairwise relation

In [9]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(day[['cnt','temp','atemp','windspeed','hum','mnth','season']],
               alpha=0.2, figsize=(15, 15), diagonal='kde')
Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x129f76ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cacf978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cb50c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cba5710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cd1f128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cbb3518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cdca860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12cb510f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cbdd2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cbbcb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cf20208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cf5b470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cf2b710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12cfd0630>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12d010748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d047940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d08d278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d0bcac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d0fb828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d134908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d184320>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12d19c0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d308978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d357390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d391f60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d3e13c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d618b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d660a90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12d671208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d6ef3c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d83a400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d87a2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d8c3be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d8ff470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d94dbe0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12d98a2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d9d24e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12d9d9c88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12da610b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12daab400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dae0f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dc2bf60>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12dc70748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dcbac50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dcfb588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dd447b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12dd941d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12ddd30b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12df1e400>]], dtype=object)