The Bitcoin Puzzle

In this post, I’ll explain how I fell into a deep rabbit hole. I’ll cover how the puzzle works and how the algorithm can be optimized in Python using techniques like parallelization and pre-compiling with Cython. Finally, I’ll reveal how you can increase computation speed by nearly 100x by implementing Metal to fully harness the power of your Apple Silicon. Let’s go!

The Puzzle

It all started with 256 transactions of roughly 33 BTC in January 2015. The first transaction to wallet #1 contained 0.001 BTC, wallet #2 received 0.002 BTC, then 0.003 BTC to wallet #3 and so on — the final wallet #256 received an amount of 0.256 BTC. Back in those days, the 33 BTC were worth about $5,000.

In the following years, additional transactions were made to adjust each wallet’s balance. In 2017, the balances from wallets 161 to 256 were moved to the lower ranges. And in 2023, someone — allegedly the creator — released another 900 BTC to all wallets from #66 upwards, following a similar principle. Wallet #66 is now worth 66 BTC instead of 0.66 BTC, and so on. At this point, the value of the transferred BTC summed up to over $27 million.

The goal of this exercise is to guess the corresponding private key from the known public address of each wallet. You might think this is impossible because it’s based on a hashing algorithm and you can’t just reverse it or brute-force the whole range of possible combinations.

You are right. Basically…

That’s why the creator limited the complexity of each private key. The first private key has an upper range limit of 1. What does that mean? As the private key is basically just a (huge) number, in the case of the first wallet, you just have to test all numbers from 0 to the upper limit of 1. This means calculating the public key (or signing key) and the compressed wallet address, and comparing it with the known wallet address. If there’s a match, you have access to the wallet and can claim your reward.

These are the first 10 items that were solved quickly after the puzzle was released (full list)

Index 0:   Hash: 00[..]00   Address: C 1BgGZ9tcN4[..]Z26SAMH  
Index 1:   Hash: 00[..]03   Address: C 1CUNEBjYrC[..]4wpP326Lb  
Index 2:   Hash: 00[..]07   Address: C 19ZewH8Kk1P[..]EiCjTRaZMZQA  
Index 3:   Hash: 00[..]08   Address: C 1EhqbyUMvvs7[..]D6YKfPqb7e  
Index 4:   Hash: 00[..]15   Address: C 1E6NuFjCi2[..]q84zJeBW3k  
Index 5:   Hash: 00[..]31   Address: C 1PitScNLyp2H[..]fveTnfmpPbfp8  
Index 6:   Hash: 00[..]4c   Address: C 1McVt1vMtCC7[..]yCcLXzueeC  
Index 7:   Hash: 00[..]e0   Address: C 1M92tSqNmQLYw[..]d1ysMBxK  
Index 8:   Hash: 00[..]d3   Address: C 1CQFwcjw1dwht[..]L7ivBonGPV  
Index 9:   Hash: 00[..]202   Address: C 1LeBZP5QCwwgXR[..]PUokyLHqe

So what’s the challenge here? Well, the complexity increases with every wallet by the power of 2. While the first wallet was limited to one private key (2⁰ to 2¹) , the second wallet goes from 2¹ to 2² — 1 (2 possible private keys). Wallet #3’s private key features a range from 4 to 7, and so on. The ranges increase exponentially. The final wallet #160 covers a range from 2¹⁵⁹ to 2¹⁶⁰ — 1, resulting in 7,307,508,186,700,000,000,000,000,000,000,000,000 possible private keys — 73 octillions. Have fun with that. (FYI: The universe is estimated to have about 10⁸⁰ atoms.)

At the lower levels, this is quite a simple task that takes just seconds to solve. And that’s why the first 66 private keys have already been found, as well as some outliers in the upper ranges.

Trivia

First of all, it is not known who actually created this task, puzzle, or riddle — whatever you want to call it.

Second, there was a lot of volume going through this wallet: tens of thousands of transactions with millions of BTC. The last transaction took place in 2017, ignoring the winner claiming their rewards.

Third, this has created a rabbit hole for some people, like me. In the chase for the big prize, they are not only throwing hardware into the ring but also knowledge and inventive minds. One of the “official threads” has dozens of pages where people share their findings; they built algorithms to improve the process further and further.

And fourth, this challenge reveals a drawback of the Bitcoin transaction system. A couple of days ago, someone found the private key for wallet #66. He or she tried to withdraw the prize from the wallet. An attacker who was watching the chain for transactions was able to “hijack” this transaction and steal the funds.

Bitcoins double spending attack

To understand how this attack works, you need to understand how the puzzle and Bitcoin transactions work. Each transaction requires three important components to identify the recipient: A private key, a verifying (signing) key and a wallet address.

By performing elliptic curve multiplication (SECP256k1), you can derive the verifying key from the private key Next, you apply several hashing operations to obtain the public wallet address:

  • SHA-256
  • RIPEMD-160
  • 2 x SHA-256
  • and finally a (simple) Binary-58 encoding

required computation steps

The ultimate goal is to find the private key for a known wallet address. As you can see, obtaining the verifying key requires only one operation, while calculating the wallet address is the more complex part. Solving this puzzle requires you to follow all the steps.

If you find a valid private key for a given wallet address, you can initiate a transaction signed with the now-known verifying key. You need to submit both the wallet address and the verifying key to get confirmation from the blockchain.

Typically, confirmation takes around 10 minutes — that’s how Bitcoin works. An attacker can monitor the blockchain for transaction requests that contain one of the 160 known wallet addresses from the puzzle. Once a request appears, the attacker obtains the verifying key from it, and the clock starts ticking. You’ve already done the hard work of brute-forcing the wallet address from the verifying key. The attacker only needs to go through all possible private keys — which is a limited range in this puzzle — and perform the relatively straightforward SECP256k1 operation to find the matching key (although it’s not as simple as it sounds).

If the attacker succeeds within the 10 minutes, they can submit their own transaction request with a much higher transaction fee. Due to the nature of Bitcoin, miners will prioritize the higher fee and confirm the attacker’s request.

Keep in mind that this method does not work universally for all transactions on the blockchain, as private keys are typically much more complex. Nonetheless, it highlights a significant drawback in the process.

Enough talk — let’s dive into the race and participate.

The algorithm

First, we will create a public (or signing) key from the private key using the elliptic curve algorithm SECP256k1.

eliptic curve

Superficial Note on the Algorithm: SEC stands for Standards for Efficient Cryptography, P represents the prime number 2²⁵⁶ - 2³² - 977 on which this curve is based, 256 denotes the length of the private key in Bit and k1 refers to the Koblitz curve, a specific type of elliptic curve. The actual curve is defined by the equation:

y²=x³ + 7

Of course there’s a Python Library for that:

private_key_hex = f’{1:064x}’
private_key_bytes = binascii.unhexlify(private_key_hex)
public_key = ecdsa.SigningKey.from_string(private_key_bytes, curve=ecdsa.SECP256k1)

As mentioned earlier, we need to compress the public key. The details are not important right now; just trust me, this step brings us closer to our goal.

verifying_key = public_key.verifying_key
x = vk.to_string()[:32]
y = vk.to_string()[32:]

y_int = int(binascii.hexlify(y), 16)

if y_int % 2 == 0:
    compressed_public_key = b'\x02' + x
else:
    compressed_public_key = b'\x03' + x

compressed_public_key binascii.hexlify(compressed_public_key).decode()

Finally, we perform SHA-256 followed by RIPEMD-160 to calculate the wallet address. If you’re interested in how SHA-256 works, I’ve written two extensive posts about it!

compressed_public_key_bytes = binascii.unhexlify(compressed_public_key_hex)

sha256_hash = hashlib.sha256(compressed_public_key_bytes).digest()

ripemd160 = hashlib.new('ripemd160')

ripemd160.update(sha256_hash)  

ripemd160_hash = ripemd160.digest()

versioned_payload = b'\x00' + ripemd160_hash

bitcoin_address = base58check_encode(versioned_payload)

I’ve compiled everything into a Jupyter notebook, which you can find on GitHub. Let’s see how it performs for the 20th wallet. This will test 524,288 possible numbers, ranging from 2¹⁹ to 2²⁰-1:

target_address = '15JhYXn6Mx3oF4Y7PcTAv2wVVAuCFFQNiP'  

measure_performance(19, 20, target_address)

On my Apple M3 with 36 GB RAM it took about 30 seconds:

Private Key (Decimal): 863317 
Private Key (Hex): 00000000000000000000000000000000000000000000000000000000000d2c55 
Compressed Public Key: 033c4a45cbd643ff97d77f41ea37e843648d50fd894b864b0d52febc62f6454f7c 
Bitcoin Address (Base58Check): 1HsMJxNiV7TLxmoF6uJNkydxPFDog4NQum 
Total time taken: 24.530259 seconds 
Average time per iteration: 0.047 ms
Initial CPU Usage: 29.7% 
Final CPU Usage: 28.5% 
Initial Memory Usage: 162.12 MB 
Final Memory Usage: 162.25 MB

This looks like a great starting point for some fine-tuning

Parallel Threads

(rel. bitcoinPuzzle-v1.py)

First, let’s move from Jupyter to a Python file and try to implement parallel threads. To effectively use parallel computing, we need to determine how many units we can actually utilize. We can leverage some built-in Python functions to find this out (also see section Measure available system resources in this Notebook):

multiprocessing.cpu_count()

This returns 12, so that’s the number of units we’ll use. Implementing concurrent processes in Python is quite straightforward. You can use either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, which serves as a wrapper class.

  with concurrent.futures.ProcessPoolExecutor(max_workers=12) as executor:

  # Part 1 - starting parallel threads for generate_key()
  
  future_to_index = {
  
  executor.submit(generate_key, start_at, target_address): start_at
  
  for start_at in range(min_val, max_val + 1)}
  
    last_reported_percent = 0
  
  # Part 2 - watch each running thread for result
  
  for future in concurrent.futures.as_completed(future_to_index):
  
    if found: break
  
    result = future.result()
  
    if result: found = result; break
      current_percent = (i + 1) * 100 // n
  
  print('Done')

The idea is simple: at some point, we start a few threads and monitor the output of each process.

Try running it, and you’ll notice that it takes quite a long time. In our case, this approach isn’t very efficient. The generate_key function is relatively fast, but it gets called twelve times, resulting in significant overhead from thread management. If generate_key were a heavier function, we could maintain this structure. However, in this scenario, we need to divide the search range into batches.

Batching

(rel. bitcoinPuzzle-v2.py)

To clarify things a bit, we can create a wrapper function called batch_key_generation.

def batch_key_generation(start, end, target_address):

  found_key = None
  
  for i in range(start, end):

    res = generate_key(i, target_address)
    
    if res:
    
      found_key = res
      
      break
    
    return found_key

The optimal batch size is determined by the number of threads. We could use a bigger value, but then the last thread would have less data to work with. And if the batch size is to small, we’d had to launch more threads than availbl processing units. Which agains creates to much overhead.

batch_size = n // num_workers # Assuming you want 12 workers  

# [...]

future_to_index = {

  executor.submit(

    batch_key_generation, 

    start, 

    min(start + batch_size, max_val + 1), 

    target_address): start for start in range(min_val, max_val + 1, batch_size)

}

Now, the execution time has decreased dramatically to just 5 seconds!

Processed 524289 out of 524288 keys (100%)...
Processed 1048569 out of 524288 keys (199%)...
Total time taken: 5.517847 seconds
Average time per iteration: 0.011 ms
Initial CPU Usage: 40.0%
Final CPU Usage: 96.8%
Initial Memory Usage: 15.20 MB
Final Memory Usage: 16.86 MB

Great! What’s next?

Stop-At-Succes

(rel. bitcoinPuzzle-v3.py)

Did you notice the slight delay between the success confirmation and the actual termination of the script? This occurs because the pool manager waits for all other threads to finish. We need a global flag to monitor the success state. Here’s how to do it:

First, we wrap our thread loop in a Manager, which acts as a global watchdog for shared state. We also provide a lock and our global state, found_flag, to our calculation function:

with Manager() as manager:

found_flag = manager.Value('i', False)

lock = manager.Lock()

concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:

future_to_batch = {

  executor.submit(

    batch_key_generation,
    start, 
    min(start + batch_size, max_val + 1), 
    target_address, 
    found_flag, 
    lock): start for start in range(min_val, max_val + 1, batch_size) }  

[…]

Of course, we should check this global variable in our wrapper function batch_key_generation so that it can stop iterating when needed.

def batch_key_generation(start, end, target_address, found_flag, lock): 
  found_key = None 

  for i in range(start, end):

    if found_flag.value: # check if it's been set from anywhere else

      return None # and then exit early, if required

    res = generate_key(i, target_address) 

    if res: 

      with lock: 

        if not found_flag.value: # double-check

          found found_flag.value = True # and tell all other process, that we found a key

          found_key = res 

          break 

return found_key  

There’s an important drawback: accessing the lock is a “time-consuming” operation. You will notice a slower processing time when using the lock, which can increase the total execution time from 5 seconds to 11 seconds. You need to decide whether you want a quick result on the command line while the threads are still running or if you’re willing to add this extra time to ensure that all threads finish by the end of the script.

Cython — pre compile our code

(rel. bitcoinPuzzle-v4.pyx and bitcoinPuzzle-v4-setup.py)

There’s another trick to speed up processing: precompiling the Python code. Don’t worry, it sounds more complicated than it actually is. We just need to rename the extension of our Python script to .pyx and create a setup.py file like this:

from setuptools import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("bitcoinPuzzle-v4.pyx"))

Now we compile the file into a binary library (you may ignore the warnings):

python bitcoinPuzzle-v4_setup.py build_ext --inplace

After that, we can either run our script from the command line or use it in any other script:

python -c "import bitcoinPuzzle_v4; bitcoinPuzzle-v4.measure_performance(19, 20, '1HsMJxNiV7TLxmoF6uJNkydxPFDog4NQum', 12)"

It may not seem significantly faster, likely due to the overhead. However, if you try larger ranges, you will definitely notice a difference. Also, remember that we implemented the global flag to exit the batch processing prematurely.

Random batches

(rel. bitcoinPuzzle_v5.py)

Well, this approach is more of a hack than a true improvement. The idea is simple: as the range increases, processing all possible variations takes more time. However, since we are searching for a needle in a vast field of haystacks, why should we start iterating from the bottom to the top like everyone else? Instead, we could pick random locations, hoping to achieve a quicker win than our “opponents”!

The implementation is straightforward. First, we create a list containing all starting points based on the chunk size, and then we shuffle it.

start_points = list(range(min_val, max_val + 1, chunk_size)) 

random.shuffle(start_points)

As this is a hack that cannot be measured in time, I’m right jumping to the final improvement:

What about Pytorch?

(rel: bitcoinPuzzle-v5.ipynb)

First, we need to identify the most expensive operations in this puzzle and then try to improve it further. Let’s determine how long it takes to run 10,000 private keys through the following processes:

SECP256k1): 2.37 seconds
3 x SHA-256): 0.02 seconds
RIPEMD160): 0.01 seconds
BASE58: 0.09 seconds

The elliptic curve operation is clearly the bottleneck. Currently, we are only utilizing our CPU without taking advantage of the full power of more sophisticated processing units like a GPU or MPS. PyTorch could assist with that, but it doesn’t natively support SECP256k1. Therefore, we need to find another solution, and this is called:

Metal!

(rel: computation.metal)

Metal is a high-performance graphics and computing API created by Apple that allows developers to harness the full power of the GPU for tasks beyond just rendering graphics, such as machine learning and data processing. It provides low-level access to the GPU, enabling more efficient parallel processing and better performance for applications on iOS and macOS devices.

Spoiler Alert: I wasn’t able to fully implement the working SECP256k1 algorithm in Metal, primarily because the elliptic curve algorithm requires 256-bit integers, which are not natively supported in Metal. This limitation necessitates significant adjustments. However, the performance gains are impressive, and I’m actively working on it!

For demonstration purposes, I chose a mathematical operation that can be implemented in PyTorch, pure Python, and Metal: the cosine!

In Metal we simply create a so called kernel function:

kernel void benchmark(device const float *inputData [[buffer(0)]],
                      device float *outputData [[buffer(1)]],
                      uint id [[thread_position_in_grid]]) {
    
    // Get the input data for the current thread
    float inputValue = inputData[id];
    
    // Compute the cosine of the input value
    float result = cos(inputValue);
    
    // Store the result of the computation in the output buffer
    outputData[id] = result; // Store the computed cosine
    
}

We compile it using Xcode’s command line tools:

xcrun -sdk macosx metal -fcikernel -c computation.metal -o computation.air
xcrun -sdk macosx metallib computation.air -o computation.metallib

This command will compile the source code into a Metal library. The most important flag here is -fcikernel, which is necessary to enable the kernel code to run on the GPU.

Here’s the complicated part: we need a wrapper in Swift to bridge the Metal source to Python later on.

First we initiate our metal device, the GPU:

guard let device = MTLCreateSystemDefaultDevice() else {
    print("Error: Unable to create Metal device.")
    return 0
}

Now we reference the benchmark function from the external library we just compiled:

let metallib =  "\(#file.replacingOccurrences(of: "/wrapper.swift", with: ""))/../computation.metallib"

guard let library = try? device.makeLibrary(filepath: metallib) else {
    print("Error: Unable to create Metal library from \(metallib).")
    return 0
}

guard let kernelFunction = library.makeFunction(name: "benchmark") else {
    print("Error: Unable to load benchmark function.")
    return 0
}

After that we initiate our command queue as well es input and output buffers:

// Create a command queue
guard let commandQueue = device.makeCommandQueue() else {
    print("Error: Unable to create command queue.")
    return 0
}

// Create a command buffer and encoder
guard let commandBuffer = commandQueue.makeCommandBuffer(),
      let encoder = commandBuffer.makeComputeCommandEncoder() else {
    print("Error: Unable to create command buffer or encoder.")
    return 0
}

// Create the compute pipeline state
let computePipelineState = try device.makeComputePipelineState(function: kernelFunction)

// Set the compute pipeline state
encoder.setComputePipelineState(computePipelineState)

// Create input buffer for GPU
let inputByteLength = numItems * MemoryLayout<Float>.size
let inputBuffer = device.makeBuffer(bytes: inputData, length: inputByteLength, options: [])
encoder.setBuffer(inputBuffer, offset: 0, index: 0)

// Create output buffer for GPU
let outVectorBuffer = device.makeBuffer(length: inputByteLength, options: .storageModeShared)
encoder.setBuffer(outVectorBuffer, offset: 0, index: 1)<

And as we also want to utilize parallel computing, we need to define how many parallel threads we want to run:

let threadsPerGroup = MTLSize(width: 64, height: 1, depth: 1)
let numThreadgroups = MTLSize(width: (numItems + threadsPerGroup.width - 1) / threadsPerGroup.width, height: 1, depth: 1)
encoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)
# tell the encoder, that we are "done" with our setup
encoder.endEncoding()

And this is how we run the process:

commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// Copy the GPU results to the output buffer passed from Python
if let outputPointer = outVectorBuffer?.contents().assumingMemoryBound(to: Float.self) {
    outputData.update(from: outputPointer, count: numItems)
} else {
    print("Error: Unable to get contents from output buffer.")
    return 1
}

(Note: I also included a Swift wrapper and some additional methods in the corresponding GitHub repository to allow for direct execution from the command line.)

Now we need to build our swift library (I am not going into the details about the Swift package configuration, check out the repo for an example):

swift build

This will give us a couple of compiled files. The most important one is (usually) hidden in .build/debug: The dynamic library file libWrapper.dylib. Before we can pull it into your Python code, we have to sign it:

// create a new private key
openssl genrsa -out developer.key 2048 
// create a signing certificate
openssl req -new -x509 -key developer.key -out developer.crt -days 365 -subj "/CN=developer/O=NickyReinert/C=DE"
// find the cert's ID
security find-identity -p codesigning
// sign the library
codesign -s "YOUR_CERT_ID" --deep --force .build/debug/libWrapper.dylib

In Python we load the library simply like that:

swift_function = ctypes.CDLL(".build/debug/libWrapper.dylib")

Now we can already reference our benchmarkin function. As we are working with pointers to the output buffer, it’s a little more efforts to get the results:

swift_function.benchmark.argtypes = [
    ctypes.POINTER(ctypes.c_float),
    ctypes.POINTER(ctypes.c_float),
    ctypes.c_int
]

# create some float values
numItems = 1_000_000_000
input_array = np.linspace(0, numItems - 1, 10, dtype="float32")  

# convert array to pointer
input_ptr = input_array.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

# create output pointer
output_length = len(input_array)
output_mutable_ptr = (ctypes.c_float * output_length)()  # Create an output buffer

swift_function.benchmark(input_ptr, output_mutable_ptr, output_length)

output_array = np.ctypeslib.as_array(output_mutable_ptr)

combined_results = np.column_stack((input_array, output_array))

print(combined_results)

Finally we run the script as any other Python script:

python computationGPU.py

Our Metal implementation completed in just 0.06 seconds (be sure to run it several times to “warm up” your system) for 1,000,000,000 items! The pure Python implementation took about 7 seconds, while PyTorch finished in 4.7 seconds. We’ve improved the timing by a factor of 75!

Congratulations! You’ve just learned how to harness the full power of your GPU. Welcome to the bottom of my rabbit hole. Next stop: implement SECP256k1 and take part in the Bitcoin Puzzle Rally!

~

(Final Note: You can also use PyOpenCL to initialize the device and command queue from within Python, which may simplify and enhance the process. I preferred the Swift wrapper because it is easier to debug.)