This post follows a notebook I made on Kaggle as an exploratory approach to Bitcoin Blockchain data. The goal of this EDA is to become familiarized with blockchain data structure, and where to find information about blocks, transactions, inputs and output - the main elements of the blockchain.
The highest level are the blocks, blocks are containers of:
The second level are transactions, which include:
Since each transaction can have multiple inputs and outputs, there are two nested tables within transaction table that provide the highest detail of a transaction. With inputs and outputs tables, you can see how much bitcoin was sent in satoshis, how much was received, how much was used for fees, the origin of each input and the recipient(s) of each transaction.
Transactions can be:
Lastly, an important concept for bitcoin is the UTXO model: in this model, all the “available” bitcoin is the bitcoin that has not been spent. So if a transaction has a hash 123, and this hash is not used as an input in any other transaction, it means that the bitcoin is available to be spent. Otherwise, any bitcoin that is used as an input, is bitcoin that has been spent and can never be used again.
Back to blockchain data - altogether, there are four tables on Big Query:
To explore Blocks table, I extract the average size by month since 2009 and plot its distribution through time. To understand transaction table, I explore the first recorded transaction in the blockchain, and to understand how inputs and outputs are used, I analyze the pizza transaction 🍕.
I currently use two ways to use BigQuery - kaggle and python. There are plenty of tutorials for python and BigQuery, so here I’ll just share the chunk needed to access it through kaggle.
from google.cloud import bigquery client = bigquery.Client()
Field descriptions from the bitcoin-etl-airflow repo.
#Lets see the first 10 blocks q_blocks ='''SELECT * FROM `BigQuery-public-data.crypto_bitcoin.blocks` order by timestamp_month limit 10 ''' blocks = client.query(q_blocks).to_dataframe() blocks.to_csv('blocks_head.csv') blocks.head()
# Aggregating block size and count over months q_blocks_m ='''SELECT timestamp_month , count(*) as n_blocks , avg(size) as mean_size , avg(stripped_size) as mean_stripped_size FROM `BigQuery-public-data.crypto_bitcoin.blocks` group by 1 ''' blocks_m = client.query(q_blocks_m).to_dataframe() blocks_m.to_csv('blocks_size_month.csv') blocks_m.head()
blocks_m['month'] = blocks_m.timestamp_month.astype(str).str[0:7] blocks_m = blocks_m.sort_values(by=['month']).copy() p = sns.barplot(x = blocks_m['month'], y= blocks_m['n_blocks'], color='teal') p.set_title('Number of blocks by month', size = 20) p.set_xticklabels(p.get_xticklabels(), rotation=90, size = 9); p = sns.barplot(x = blocks_m['month'], y= blocks_m['mean_stripped_size'], color='teal') p.set_title('Average block stripped size by month', size = 20) p.set_xticklabels(p.get_xticklabels(), rotation=90, size = 9);
#Lets see the first 50 transactions trx = """ SELECT `hash` as txn_hash ,size as txn_size ,version as txn_version ,lock_time as txn_lock_time ,block_hash as txn_block_hash ,block_number as txn_block_number ,block_timestamp as txn_block_timestamp ,input_count as txn_input_count ,input_value as txn_input_value ,output_count as txn_output_count ,output_value as txn_output_value ,is_coinbase as txn_is_coinbase ,fee as txn_fee FROM `BigQuery-public-data.crypto_bitcoin.transactions` order by block_timestamp limit 50 """ query_job = client.query(trx) iterator = query_job.result(timeout=60) rows = list(iterator) # Transform the rows into a nice pandas dataframe trx = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows.keys())) trx.to_csv('transactions_head.csv') trx.head()
Without those nested tables, there’s no much “transaction” information you can get, other than the timestamp, the output value in satoshis, and a flag if it’s coinbase or not. Coinbase = True means that it is the first transaction included in a block, that is, the one is created for mining. We can use the hash and go online and explore a transaction.
# Get the hash of the very first bitcoin transaction and explore it on blockchain.com trx.txn_hash
Use this hash and check it on the blockchain explorer to confirm some values from the table. For example: is_coinbase = True since it’s the first transaction included in a block, the output value of 5000000000, in sats, was the 50 BTC reward for processing a block back in 2009. Other pieces of info such as addresses can’t be obtained from the transactions table alone, but we need to see the inputs and outputs.
Field descriptions very similar to the inputs table:
The previously mentioned transaction was the first one, and involves 50 btc that have never been spent, so it has no input (was created from mining) and no linked transaction after that (never spent). To go deeper into a more typical transaction that has inputs and outputs, we need the nested tables inputs and outputs.
A good example for is the 🍕 transaction, with hash
a1075db55d416d3ca199f55b6084e2115b9345e16c5cf302fc80e9d5fbf5d48d we can pull information from the three tables. The pizza transaction is the first known use of Bitcoin to buy delicious food. It was broadcasted to the network on May 22, 2010, after Laszlo Hanyecz paid 10,000 for two pizzas.
#Lets see the pizza transaction, and how the recipient of those 10K btc spent the coins) q_anatomy = """ SELECT `hash` as txn_hash ,block_timestamp as txn_block_tms ,input_count as txn_input_count ,input_value as txn_input_value ,i.spent_transaction_hash as nested_hash ,i.value as nested_input_value ,output_count as txn_output_count ,output_value as txn_output_value ,o.value as nested_output_value FROM `bigquery-public-data.crypto_bitcoin.transactions` JOIN UNNEST(inputs) as i JOIN UNNEST(outputs) as o WHERE `hash` in ('a1075db55d416d3ca199f55b6084e2115b9345e16c5cf302fc80e9d5fbf5d48d', 'cca7507897abc89628f450e8b1e0c6fca4ec3f7b34cccf55f3f531c659ff4d79') ORDER BY block_timestamp """ query_job = client.query(q_anatomy) iterator = query_job.result(timeout=60) rows = list(iterator) # Transform the rows into a nice pandas dataframe anatomy = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows.keys())) anatomy
Here we get 133 rows - for just two transactions! Let’s analyze some information we get from this table:
There are 131 rows that have a txn_hash starting with a1075… these are the inputs of the pizza purchase. This means that the pizza eater pulled coins from 131 different sources, adding up to 10,000 BTC, which were sent to only one address.
The input count is 131 (gathering from 131 sources) and the output count is just 1 (sent to just one address).
The transaction input value is 1000099000000, because it’s including the fee.
The transaction output is 1000000000000, because the recipient doesn’t receive the fee.
There are two rows that have a txn_hash starting with cca750… this transaction is what the recipient of the 10,000 did: he split the 10K in two addresses, sending 577700000000 to one and 422300000000 to the other.
We can see how each transaction is linked to the other, in these two rows the nested_hash shows the input of the transaction, which is the hash a1075… - the pizza transaction.
Mapping the elements of the inputs and outputs table with concrete transactions was my only method to understand what each table contains. After mapping the most essential fields with its meaning, it could be possible to actually analyze Bitcoin Blockchain data – but these are baby steps, as bitcoin blockchain data is hard to load and manipulate.
Be the first and only to leave a comment ever - via github on this ticket. Comments should appear on this page instantly, but I woudln't be surprised if it's not working (my bad).