No doubt that crypto currencies with all the promises they bring, both financially and otherwise, are only here to stay. As a data scientist interested in data and numbers, I thought it would be nice to take a look at some crypto currencies with my favorite tool, R.

R Libraries

Below is a list of R libraries we will be using to help us with our analysis. Not all of them are necessary but they all will make our life easier.

library(PoloniexR)
library(data.table)
library(lubridate)
library(Quandl)
library(plyr)
library(stringr)
library(ggplot2)
library(plotly)
library(janitor)
library(quantmod)
library(pryr)
library(corrplot)
library(PerformanceAnalytics)
library(tidyr)
library(MLmetrics)
library(readr)
library(knitr)

Getting the Data

1. PoloniexR Package

The easiest way to get current and historical data for cyrpto currencies is by using the PoloniexR developed by Vermeir Jellen. Vermeir Jellen gives a good tutorial on how to start with his package here. The Poloniex exchange includes many coins but not all. For missiong coins on Poloniex, one can scrape the coinmarketcap page, an example is given here.

2. Quandl

Quandl is my go to place for any financial data. Their free-tier API has lots of good data one can use. Quandl offers data from multiple exchanges. Locating crypto data on Quandl is not straight forward. After spending few hours on their site I found out that most of the crypto data can be found here

First, let’s take a look at different exchange data for Bitcoin using Quandl. We will download and plot historical bitcoin data from the following exchanges Kraken, Coinbase, Bitstamp, and ITBIT

# enable your Quandl API key
my_quandl_api_key <- read_file("../../quandl_api_key.txt")
Quandl.api_key(my_quandl_api_key)

# function to download quandl data
get_quandl_data <- function(data_source = "BITFINEX"
                            , pair = 'btcusd'
                            , ...){
  
  # make sure the user supplied the correct data_source
  if(toupper(data_source) != "BITFINEX") stop("data source supplied is wrong...")
  # quandl is case sensitive, all codes have to be upper case
  pair <- toupper(pair)
  tmp <- NA
  try(tmp <- Quandl(code = toupper(paste(data_source, pair, sep = "/")), ...), silent = TRUE)
  return(tmp)
}

# get btc data from different exchanges

  exchange_data <- list()
  
  exchanges <- c('KRAKENUSD','COINBASEUSD','BITSTAMPUSD','ITBITUSD')
  
  for (i in exchanges){
    exchange_data[[i]] <- Quandl(paste0('BCHARTS/', i))
  }

We need to convert this list of BTC prices from different exchanges into a dataframe and put them all in one data frame so we can plot them.

# put them all in one dataframe to plot in ggplot2
btc_usd <- do.call("rbind", exchange_data)
btc_usd$exchange <- row.names(btc_usd)
btc_usd <- as.data.table(btc_usd)

We also need to do some minor cleaning, so let’s do that. We also need to get rid of rows of data with 0 weighted price.

# some data cleaning
btc_usd[, exchange := as.factor(str_extract(exchange, "[A-Z]+"))]
btc_usd <- clean_names(btc_usd)
btc_usd <- btc_usd[weighted_price > 0]

# set datatable key to be the date column
setkey(btc_usd, date)

Let’s take a look at the data table we just made.

head(btc_usd)
##          date open high  low close volume_btc volume_currency
## 1: 2011-09-13 5.80 6.00 5.65  5.97   58.37138        346.0974
## 2: 2011-09-14 5.58 5.72 5.52  5.53   61.14598        341.8548
## 3: 2011-09-15 5.12 5.24 5.00  5.13   80.14080        408.2590
## 4: 2011-09-16 4.82 4.87 4.80  4.85   39.91401        193.7631
## 5: 2011-09-17 4.87 4.87 4.87  4.87    0.30000          1.4610
## 6: 2011-09-18 4.87 4.92 4.81  4.92  119.81280        579.8431
##    weighted_price    exchange
## 1:       5.929231 BITSTAMPUSD
## 2:       5.590798 BITSTAMPUSD
## 3:       5.094272 BITSTAMPUSD
## 4:       4.854515 BITSTAMPUSD
## 5:       4.870000 BITSTAMPUSD
## 6:       4.839576 BITSTAMPUSD

The data includes 10 columns, the date, OCHL prices, volumes in USD and BTC, the weighted price, and the exchange. I wish I had bought me some bitcoine back in 2011!!!

Now we’ll look at the price of bitcoin and color code it by exchange.

ggplot(btc_usd, aes(x = date, y = weighted_price, col = exchange)) + geom_line() + theme_bw()

## Arbitrage

It appears the prices of btc on different exchanges are fairly consisant. But this is an artifact in the figure since we are covering several orders of magnitudes during the timeline we selected. To better see any price differenes we need to zoon in on the figure. Let’s zoom in on, say the first month of 2018, were we had the ATH for all coins. This will enable us to better see any differences in prices.

ggplot(btc_usd[date >= ymd("2018-01-01") & date <= ymd("2018-01-31")], aes(x = date, y = weighted_price, col = exchange)) + geom_line() + theme_bw()

There are obvious differences in prices between the exchanges. Differences seem to vary over time as well. Actually it will be interesting to look at the maxiumum price differences as a function of time, let’s do that

# first let's find the minimum price by date
btc_usd[, min_price := min(weighted_price), by = date]

# now we need to find the price difference between the price for each day and the minimum price for that day
# but since the price of bitcoin varies a lot for the time period under study, we need to normalize the price difference
# to do that we will just divide by the median price for each day
btc_usd[, price_diff := 100*(weighted_price - min_price)/median(weighted_price), by = (date)]

Now we have a new column which gives us the percentage of price differences for each day normalized to the median price for each day. Let’s take a look at the new table.

tail(btc_usd)
##          date    open    high     low   close volume_btc volume_currency
## 1: 2018-06-10 7498.00 7498.00 6627.70 6781.17  14392.695       100255171
## 2: 2018-06-10 7496.00 7496.00 6618.07 6762.19   7958.791        55442370
## 3: 2018-06-11 6768.20 6850.00 6629.00 6784.90   3392.389        22896077
## 4: 2018-06-11 6770.31 6839.30 6666.66 6789.98  10742.752        72568653
## 5: 2018-06-11 6781.17 6835.22 6634.86 6790.00  10684.656        72110479
## 6: 2018-06-11 6765.66 6838.94 6625.61 6793.06   5055.952        34091243
##    weighted_price    exchange min_price price_diff
## 1:       6965.698 BITSTAMPUSD  6942.736 0.32962864
## 2:       6966.180    ITBITUSD  6942.736 0.33653992
## 3:       6749.249   KRAKENUSD  6742.794 0.09565161
## 4:       6755.127 COINBASEUSD  6742.794 0.18273666
## 5:       6748.975 BITSTAMPUSD  6742.794 0.09158795
## 6:       6742.794    ITBITUSD  6742.794 0.00000000

The reason I looked at the newer dates is that prior to 2014 we only have data for one exchange, so all the price differences were 0. Let’s take a look at the price differences as a function of time.

# first we need to create a new data table with only the maxiumum prices per day
tmp <- btc_usd[, price_diff := max(price_diff), by = date]

# This will help us visualize overlapping points
MyGray <- rgb(t(col2rgb("black")), alpha=50, maxColorValue=255)
tmp[, plot(date, price_diff, pch=20, col = MyGray, xlab = "Date", ylab = "Maximum of Price Difference (%)")]

## NULL

Let’s show the plot with log scale on y axis. Let’s also discard dates with zero price differences.

tmp[price_diff > 0, plot(date, price_diff, pch=20, col = MyGray, log = "y" , xlab = "Date", ylab = "Maximum of Price Difference (%)")]

## NULL

As one can see from the figure above, the bulk of the maximum difference in bitcoin prices between the different exchanges is in the 0.5-2.0% range. It is also interesting to see that the differences in prices seem to have come down between 2014 and 2016, but they seem to go up starting in 2017. Let’s fit a gam model to see what we get.

# we'll use ggplot and fit a gam smooth line
ggplot(tmp[price_diff > 0 ], aes(x = date, y = price_diff)) + geom_point(alpha = 0.2, shape = 16, size = 3, show.legend = FALSE) + scale_y_continuous(trans='log10') + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) + theme_bw() + xlab("Date") + ylab("Maximum of Price Difference (%)")

The regression line shows a hint of an over all downtrend from 2014 to mid 2016, except for an uptrend for few months in late 2015. The trend seems to have gone up in mid to late 2017, and again we see a downword movement in price differences starting in December of 2017. This can be seen better in the box-plot figure below.

tmp[, month_year := format(as.Date(date), "%Y-%m")]
ggplot(tmp[price_diff > 0], aes(x = month_year, y = price_diff)) + geom_boxplot() + scale_y_continuous(trans='log10') + xlab("Date (Year-Month)") + ylab("Maximum of Price Difference (%)") + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

The box-plot figure above shows the variation of maximum differences in prices as a function of time. On the x-axis I grouped dates by month since anything less than a one-month period will result in congested figure.

Okay, now let’s find out which of the exchanges contribute the most to these price difference. That is, we are trying to determine which exchanges are constantly selling bitcoin higher, or lower, than the rest of the exchanges. We need to pull some numbers as below, and then we’ll make a bar plot to show the leading exchanges in each category.

# add two columns to our data table which will contain the minimum and maximum prices
tmp[, `:=`(day_max = max(weighted_price), day_min = min(weighted_price)), by = date]

# now put only the columns we care about in a new data.table
tmp2 <- tmp[price_diff > 0 , .(date, exchange, weighted_price, day_min, day_max)]

# notice how we excluded days with no price difference
# now we only want to keep the rows with the maximum and minimum daily prices
tmp2 <- tmp2[weighted_price == day_min | weighted_price == day_max]

# now we'll add a new column designating the price as being the minimum or maximum
tmp2[, max_min := ifelse(weighted_price == day_min, "min", "max")]

# clean the name of the exchange
tmp2[, exchange := gsub("USD", "", exchange)]

# now we'll add a new column containing the exchange name and the min_max column
tmp2[, max_min_exchange := paste(max_min, exchange, sep = "-")]

In the above chunk of code we created a new table which contains the maximum and minimum prices for each day. The table also contains a categorical column showing to which exchange this max/min price belong, and if the price was a maxima or a minima. Before we plot the table above, let’s have a quick look at it.

knitr::kable(head(tmp2))
date exchange weighted_price day_min day_max max_min max_min_exchange
2013-08-25 BITSTAMP 111.5484 110.3100 111.5484 max max-BITSTAMP
2013-08-25 ITBIT 110.3100 110.3100 111.5484 min min-ITBIT
2013-08-26 BITSTAMP 112.1077 105.8300 112.1077 max max-BITSTAMP
2013-08-26 ITBIT 105.8300 105.8300 112.1077 min min-ITBIT
2013-11-04 BITSTAMP 220.4782 207.4124 220.4782 max max-BITSTAMP
2013-11-04 ITBIT 207.4124 207.4124 220.4782 min min-ITBIT

The max_min_exchange column contains all the data we need, so let’s make a barplot of this variable, we’ll color the barplot by the max_min criteria shown in max_min column

# now make a barplot 
ggplot(tmp2, aes(x = max_min_exchange, fill = max_min)) + geom_bar() + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Exchanges with the highest and lowest price differences in Bitcoin") + ylab("Frequency") + xlab("Exchange") + scale_fill_discrete(name = "BTC Price Diff. Type")

This is interesting, Kraken seems to be the exchange with the most frequent highest prices for bitcoin. On the other hand, Bitstamp seems to be the one with the most frequent lowest prices among exchanges. So if you want to do arbitrage your best bit is to buy on Bitstamp and sell on Kraken.