Hari Recipes

Welcome to The Internet

The internet sucks. I think everyone feels the same level of disgust when going online in 2024 and seeing the enormous volumes of content, most of which is garbage, inaccessible, inconvenient, or locked behind walled gardens.

The modern internet is an “ecosystem” in the same way that a landfill is an ecosystem. We are like seagulls, picking through mountains of trash looking for slivers of useful information, while around us, bulldozers are piling up even more trash the whole time. My hope is to build things that are the opposite– useful and simple.

Hari Recipes is a web app I built in <80 person-hours of work. It is a simple app that provides quick, (somewhat) accurate semantic search over a database of 300,000 recipes scraped from the internet. It is designed to be self-hosted, or hosted on small virtual machines with little to no configuration. It is also designed to be hackable (stupidly simple). Each part of the app, from the code used to crawl the recipe websites, to the code for semantic search, is able to be played with and run in isolation.

The goal here is to disentagle you dear reader, and anyone else who you know that likes cooking, from every recipe website. Starve the beast by not playing any ads or giving them any traffic. Host it yourself and use it on your wifi, or host it in the cloud or a RaspberryPi and share the link with your mom so she can use it.

Here is an actual example of me looking for a sweet tea recipe on the internet on my phone.
Here’s what the networking tab looks like for that page: Who does this help? What problem does this solve? How many hours of human labor and kilowatt-hours of energy were spent making this shit, which is not only useless, but actively antagonistic to users? Cancer is the only word I can think of to describe this. Cells that have forgotten their original purpose and now just exist to propagate themselves.

Here is a query for sweet tea on hari.recipes

I am not a very clever person. But even the most mediocre developer can solve real problems. I believe hari.recipes is solving a real problem, and it took very little of my time to make.

The rest of this post is about how I built the project. But the main takeaway I would like for everyone to have is that we can starve the beast, even just a little bit at a time. If everyone solves small problems and share those solutions with their friends, maybe we can reclaim a little bit of the “good internet”.

Here are some websites that actually provide value, which I took inspiration from:

Round House Overalls
RockAuto
Maestro Spanish

Note:

Not all recipe websites are as bad as the one I linked. Many of them are very well-curated and run by well-meaning individuals. Practice discretion between content farm recipe websites, and ones like this

Ideation

Earlier this year I migrated all of my personal notes off of Notion and onto a github repo that uses Obsidian as a markdown renderer. Migrating away from Notion had some issues which I speculate might be a form of vendor lock-in, but eventually I got it done. I started using Obsidian for note-taking very frequently, and started using it a lot for saving recipes.

Note:

Don’t trust these people with your stuff

I noticed that manual curation of recipes was actually far superior to just bookmarking them in the browser. I could correct errors in the original recipe, add clarifications, substitutions, etc. I think that the difference is one of ownership. I figured that this pattern of copying down recipes into markdown and saving them locally could be automated, and so the project began.

Note:

Ben Awad (a very funny software dev/content creator) already had the same idea about recipe curation and ownership. However, it is a paid app that requires an account. Unfortunately I consider this solution to be equally as contrived as the problem it solves (no offense to Ben, you gotta make a bag somehow, I respect the hustle). But, I feel like I would be taking undue credit for the idea if I didn’t mention it here.

https://www.mysaffronapp.com/

Finding Recipes

Finding the recipe scrapers Python package meant that the project was not only going to be doable, but a piece of cake. The hard part was already solved for me.

Most recipes are stored in a script tag with the type application/ld+json. There are a couple of issues here for our webcrawler.

To tell if a page contains a recipe, we have to scrape the page for every script tag, then check if the script is an ld+json. This is slow.
Some pages have multiple ld+json scripts! How do we know which one is a recipe?
Some websites, like grouprecipes.com dont follow this convention at all.

The solution I came up with was just to start with a basic spider (spider.py). All this does is a basic depth-first search of every link on a website that belongs to the same domain.

To solve problem #1, we can do a little investigation for each website by hand and see if there is a url pattern that designates a reicpe. In our websites.json file, I store this like:

    "recipe_prefix": "https://www.allrecipes.com/recipe"

Problems #2 and #3 require solutions on a case-by-case basis. I just had to keep adding checks and refactoring the spider to account for different inconsistencies between websites as I went.

Because crawling the websites is a long-running process prone to interruptions, I set a checkpoint that dumps the entire spider object with internal state to a pickle file. This way we can resume scraping for every 5000 items. The code in this section is pretty bad, but it gets the job done.

Once all of the recipe urls are saved into a file called all_recipes.csv we can scrape them with a simple script using recipe-scrapers.

Cleanup

The initial data is really bad. Out of ~500k recipes, only ~340k will be useful.

The simplest type of cleanup we can do is determine the minimum amount of information required for a recipe to be valid. I settled on:

title
source url
ingredients list
instructions list

If a recipe does not have None for any of these fields, it will be parsed into a Pydantic model and move on to the curated dataset.

There is a lot of junk in the dataset as well. Once I got the data searchable I tried out some queries of different terms like “Bitcoin” and, well…

To determine if a recipe was spam or not, I ran an LLM locally using LM Studio. I have never tried self-hosting LLMs, but I was amazed by how easy it was to run a local inference server. I used hermes-3-llama-3.1-8b which fit on my RTX 3070 and runs at about 0.5s/inference. It took about 48 hours to process every single recipe, using 14.4kWh, equivalent to ~1 gallon of gasoline worth of CO2.

Vector Search

This project probably would not have been possible 2 years ago. At least, not for a similar amount of effort. A Python library called SentenceTransformers lets you run pre-trained BERT models on a corpus of text. Getting search on my recipes is as simple as writing a function to turn RecipeData into a string, and then feeding those strings into a model. The resulting tensor is just a list of vector embeddings of each recipe string. Here is what search looks like after the embeddings are generated:

Why did I use a non-food example query for all of these images?

It takes 45s to load the recipes and full-precision vector embeddings, 4GB RAM to hold it all in-memory, and 790ms to retreive 20 results. This is not ideal performance, but it provides a decent starting point for a self-hosted app.

Optimizations

Sqlite DB

The first thing we can do is stop storing our recipes in memory. Uncompressed our recipes take up almost 1GB. We can store all the recipes in a dead-simple sqlite table. Column 1 will be the recipe id, column 2 will be a JSON of the dumped Pydantic model.

The result is less than 100 lines of code I am using the repository pattern to abstract away the details of the data storage. This way hackers can choose if they want to store their recipes in-memory in json or on-disk in sqlite. All they need to do is switch the class they use.

Vector Precision

Our vectors are made up of 384 32-bit numbers by default. This post was at the top of Hackernews the same day I was working on the vector embeddings. According to this post, you can get really good similarity scores even just binary-precision vectors. So instead of [float32, float32] we can use [0, 1] and have similar performance (~96-97% accuracy).

With binary precision, our vector embeddings are 30x smaller, and comparisons run about 100x faster. Our search quality is degraded, but as long as we give the user the ability to turn on higher-precision when they are self-hosting, I don’t mind the lower performance. My goal right now is to allow for hosting even on smaller hardware.

API

app.py

We only have 4 routes.

”/” is the default which just shows the search bar
“/recipe_query/?query=&num_items=” shows a list of items for a given query
“/recipe/?index=” displays a recipe
“/recipe/json/?index=” displays the raw json of a recipe

Templating

Our templates are also stupidly simple. 3 templates, the base template, the query_results.html template for displaying a list of recipes, and the recipe_detail.html template for rendering a single recipe. There is not a single line of Javascript, and about 14 lines of in-line CSS ripped from even better *** website

Deployment

I use Caddy to handle https for me. I am not gonna lie, I have no idea how Caddy works, I copied this template. As soon as I am done writing this blog post I am going to educate myself about how Caddy works but right now I just want to get this shit hosted.

Note:

Google migrated all of my domains to Squarespace earlier this year. Fine, whatever, this is the state of tech in 2024. I initially bought the domain hari-recipes.com on Squarespace, just to keep all my domains in one place. Bad idea. Squarespace was not updating my A record to DNS servers. It would not let me use custom nameservers, so I could configure my records elsewhere and it would not let me unlock the domain to move it to another provider. We’ll see if Mastercard lets me chargeback for this shit because they are quite literally stealing my money by not allowing me to do anything with the domain I paid for. hari.recipes cost me $6 on namecheap, and I was able to set up DNS in AWS Route 53 in about 30 seconds. I have no love for Amazon, but at least their services work.

The current app is running on DigitalOcean’s $6 VM. I tried the $4 option, but 512MB of RAM was just not enough to build all of the Python dependencies.

Load Tests

Thanks to all of our optimization and the simplicity of our app, we should be able to serve a pretty decent number of concurrent users on a small VM. I ran load tests using Locust on my PC. I figure that the relationship between performance on my desktop and the cloud VM should be a linear one.

So the ratio of max concurrent users on my desktop / max concurrent users on the VM, should be the same as query duration on my desktop / query duration on the VM

We can see that on my PC, the app can handle about 20 RPS before the 95th percentile response time starts to spike. I didn’t push it as far as it could go, because really I don’t expect many people to use this app, but 300ms resonse is fine for a recipe query.

A query with 250 matches runs in about 0.01 to 0.02s on my machine.

On the VM, the same query runs in 0.04 to 0.08s, so 4x slower.

Based off of my load tests, the VM should be able to handle at least 10 (simulated) concurrent users before response times get up above 300ms. Each of those simulated concurrent users are making query requests once per second. So in reality, a single $6 VM can probably handle significantly more real human users.

Results!

I claimed that this project was built to solve a real problem. Here is my evidence for that claim:

The problem has been solved, the jalapenos have been candied. There are no frameworks to hop between, I can move on with my life.

Full Stack Footguns

This blog post won’t be very interesting to anyone who knows what they are doing, but it’s helpful for me to document my own progress. Back in March 2023 I put together an app called ai-arena the idea is that you can write some code in the browser to control units in a RTS-style game, and then upload that code to a server where the code competes 24/7 online for control of a galaxy.

I got the idea from my AI class in school where we could upload our code to a competitive ladder for extra credit.

A screenshot of “Space Settlers” from CS 4013 at OU

Screenshot of my game written for HTML canvas

This project was exclusively a learning exercise for me. I ended up making:

A Game Engine in canvas to run my game, which could be imported as an ES6 module and run both in a webpage rendered to Canvas, or headless in nodejs for running a simulation
Code sandboxing inside the game engine to prevent user code from manipulating the game, browser, or node process/environment
A Svelte SPA website
An In-browser code editor
A Lambda function for user code santization and and testing
A NodeJS game server for running individual games safely
A NodeJS Galaxy server for simulating the galactic-scale battles
A ThreeJS Galaxy for visualization
A Supabase backend for handling everything database-related

Looking back on it, there was quite a lot of stuff that I ended up needing to do for this project. But what stands out to me the most are the numerous horrible mistakes I made. So in this blog post I am going to go through all of them, and talk about what I would like to change in the future.

The Game Server

The game server is the backend that runs continuously. The first piece is the “Campaign Service” that makes calls to the DB and keeps track of the state of the galaxy, sort of like the world map in the game Civilization. When two players have a conflict, the conditions of the conflict are serialized and sent to a Beanstalk queue.

The next piece is the “Battle Server”. This service just polls the queue for conflicts, then reserves them while it executes a simulation of a battle. Whoever destroys the enemy base within the allowed timespan wins. If the simulation lasts too long, then the winner is determined based on some heuristic. If a player’s code consistently times out or uses up too much memory, they lose by default. The Battle Server posts the results of each battle back to a different channel on the Beanstalk queue.

The Campaign Service occasionally polls the results channel on the Beanstalk queue for results, then updates the DB with these results. The Beanstalk queue is a great layer between the Campaign and Battle services. The campaign service is heavy on networking requests but low on computational overhead. The battle service is pretty much exclusively computational overhead. Having a queue between them allows for multiple battle services to be spun up and consume from the Beanstalk queue continuously.

A diagram for those who hate reading

The problems here lie in the implementation. The logic in the Campaign Service is very convoluted. Between checking the queue, pushing to the queue, and updating the database, there is a lot of looping, stopping and starting, and conditions based on the states of the queue. The state of the Campaign Service is spaghetti’ed all over the place and it’s not obvious to me what it should look like at any given point. This kind of logic was easy to write, but absolutely horrible to read and amend.

Something else I (very foolishly) didn’t anticipate at all were network errors. Timeouts, failures to resolve URLs, I had not written handling for any of those sort of errors. This wasn’t an issue when I was deploying to DigitalOcean droplets, but on my RaspberryPi on my apartment wifi, almost every API call should have incorporated some sort of timeout error handling with backoff and retry.

Lastly, there are a handful of difficult to reproduce errors. Running in Docker makes it even harder to test these errors.

How to fix

State

Every iteration of the loop should look something like this:

    F(S) -> S'

When reading from the game queue it would look like

    F(q, S) -> S'

If we have a nice, centralized state with easy to understand transition functions, the code should be a lot easier to test and maintain. Writing unit tests for the Battle Service and Campaign Service would help me separate the non-networking logic into its own components.

Networking

Implementing a proper service adapter. Currently I just have functions that loosely wrap the Beanstalk and Supabase client libraries. I would add error handling and defined behavior for dealing with connection failures to both of these services. As well as making them async so that they aren’t blocking the rest of the application when they hang.

Switching from my hard dependency on the Beanstalkd and Supabase clients to an Adapter layer would let me switch to a dependency-injection model, which would allow me to do mock testing of both services more easily.

Third, integration tests. This part I am not really sure how I would go about it. Like I said, it’s a bit hard to parse through complex logic like this, but maybe reducing the scale of the galaxy to 10-20 stars instead of several thousand, and then walking through the simulation by hand, and hard-coding those states as my test cases.

Finally, logging. Some sort of logging whenever things crash. This would help me diagnose what causes crashes during the long-running games.

Writing all of this down makes me feel pretty embarrassed about just how awful my practices were for this project, but doing things the wrong way helped me really internalize the message of the SOLID and TDD resources that I read later on.

The Website

The website is very, very ugly. I wanted to challenge myself and write all of the CSS myself. I don’t think this was a good idea. Not just because I am bad at CSS, I think challenging yourself to improve is a noble effort, but because I spent most of my time focusing on the CSS and wrangling with it, and let my components suffer for it.

The components should have taken the center stage and been the focus. What are the potential states the components can be in? How do they handle that state. Can they be unit tested for these different states?

There is a lot of ambiguity to the user interaction too. I had a couple of experienced devs try out the website and flat out tell me they had no idea what was going on. This is never a good sign. I should have first made sure that interaction flow made sense and I had states for every possible outcome of a user/networking interaction before I even thought about adding functionality.

Some of the API code is… terrible. A veritable fettucini alfredo of chained async functions. This is partially because of laziness and refusal to write SQL queries, and partially because of laziness and refusal to use await.

How to fix

Focus on the components first. Having components driven by data would have saved me quite a bite of code reuse for things like popup modals and labels. Unfortunately I was only recommended the great book Refactoring UI recently. Also this. Make sure the components have unit tests and interaction flow that makes sense.

Use styling generated by a professional. The Skeleton UI toolkit was recently recommended to me. In the future I would probably build a prototype of a single page of the website with something like this, then worry about adding in routing and networking later.

Focus on networking LAST. Mock the database or API or whatever. Start with UNIT TESTS, then do MOCK TESTS. Not to test features, but just to test that interaction flow.

Supabase

I don’t really have much to say about Supabase. I should have deployed a local version of Supabase for testing and development purposes. I was actually working on this but got bored and moved on from the project after a while. Having a docker-compose that runs the supabase container + all my backend services would make it possible to do integration tests with Github actions I think. To support this I should make that adapter layer I mentioned earlier, which would provide methods for some of the specific interactions the Campaign Service requires, like starting a new “War” and populating the new galaxy with a bunch of stars.

It would also help to write a repository layer that I could use instead of Supabase, for mock testing.

The game itself

Since finishing this project, I found out about oort.rs which is a competitive coding game where users write code to control spaceships. It is extremely polished, fast, and safe (thanks to Rust). The game runs in wasm on the browser and the backend.

Initially I thought Javascript would be easier for the browser, but it is just too difficult to write and test Javascript for a game like this. It’s too easy to write code that doesn’t have any errors until the 9000th tick.

I’d like to rewrite the entire game in Rust, same as oort. I think I would keep gameplay the same, since there’s no way I can compete with the features in Oort. Instead I’d like to add more functionality to the battle map.

In addition to Rust, I think I could leverage a few things to improve the game.

ECS
Thinking about the state functionally
Decoupling the renderer from the state

ECS might be overkill for a simple 2D physics game like this, but the faster I can simulate games on the backend, the more players I can support and the larger galaxies I can run. Right now the whole game server backend is running on a single raspberry pi in my bookshelf. It takes about 15s to simulate a single battle for 4500 ticks, which is only 2.5 minutes of game time for a single battle.

I hope that Rust will increase performance as well, but I don’t know if I should really expect performance gains from wasm over the V8 engine.

As for the game loop, I think it would make sense to have all the arrays for ECS comprise of a single state object that you can serialize and deserialize, pass into a renderer, throw into some function that computes a winner, etc.

The game loop should look like this:

initalize(Params) -> S
while not HasWinner(S):
    update(S) -> S'
    run_user_code(S') -> S''
    execute_user_commands(S') -> S'''
    render(S''')

Rust also gives the huge benefit of static type checking, so it should be really difficult to upload unsafe code to the game server.

Conclusion

I was pretty proud of myself when I finished this project after about a month of continuous work, but now looking at it I feel pretty embarrassed. It obviously wasn’t a waste of time, I learned quite a bit and most importantly it helped me contextualize a lot of the “programming best practices” stuff that I read afterwards.

I can’t get the idea of a persistent online galaxy-scale war between code bots out of my head, so I will probably return to this project again at some point in the future, hopefully with this post as a good reference point for me to not make all of the same mistakes again.

Fun with Quadtrees

I think Quadtrees are cool. Trees are definitely my favorite data structure, hands-down. There is something about Log efficiency that makes them amazing. Tries, Quadtrees, Octrees, are great and simple! I have yet to implement a K-D Tree or a BSP Tree, mostly because I haven’t had a reason to.

The main appeal of trees to me, and I hope to everyone else, is their ability to give the illusion of massive size. In a game like Kerbal Space Program you can see the ground in relatively high detail when you are walking on the surface, and also see the entire planet from space. You can land on the other side of the planet and find equally as much detail! How much memory does that planet take up if the whole thing is visible at once then? Most people should already know the answer to this question, it’s called LOD. Further objects are rendered in less detail than close ones. But no matter how much more I learn about LOD and tree-based systems for spatial partitioning, the magic never goes away.

Here is an example of a Minecraft mod that adds LOD-based rendering as a skirt around the main area rendered by the game.

It blows my mind just looking at it. It gives computers this illusion of infinite power, the ability to hold infinite worlds inside of a machine.

Anyways, about a year ago I had a project idea that required me to learn several things at once. I wanted to make a 3D map of the game Foxhole and host it on the web to show off to people in the official Foxhole Discord and subreddit. I ended up learning Three.js, slippy tiles, and how to write a quadtree!

I had some screenshots on my phone and I remembered them other day. I thought they were pretty so I decided to make a blog post with them.

The quadtree implementation itself (oh God, I just looked at it, this is horrible and unreadable) is dead simple, but it can be pretty challenging to think through the algorithm and figure out what the program should be doing at a given point. Lots of looping and recursive programs have this issue. At the beginning there were lots of issues that I couldn’t be sure if they were occuring in my quadtree implementation, my tiling program I wrote in Python, or an issue with my coordinate system in Three.js

A weird issue with offsets

A single map region rendering

Strange issue with only the the bottom row of tiles being rendered, but at varying LODs

Now the whole map is rendering, but also a bunch of “parallel universes”

The random colors are just for contrast, but I think they look really cool. It looks like some sort of board game or kaleidoscope– and vaguely reminds me of the way land parcels are laid out in a grid in the US.

The developers really did an amazing job with this map. If you look closely you can see that the hexagon in the center is slightly higher than all the rest. I believe this was the first hex they created, and had to fit into the normalized range of the other surrounding tiles, so it was adjusted.

I took this image because I thought the specular reflection on the glossy water looked really cool in contrast to the softly-shaded and squishy-looking land. Something about this style of render is really peaceful.

The mesh is generated from a heightmap that someone extracted from the game. The heightmap and texture are just served from a URL, using github as a CDN and organizing the tiles in folders. This is how slippy tiles work for mapbox or Google maps. When a node in the quadtree is active, it uses the xyz position to request a texture and heightmap file from a URL, and uses that to generate a textured mesh at a given offset. Pretty simple.

The actual deployed site uses low-res textures and a lower fidelity mesh and a stolen water shader and sky shader. I wanted it to be able to run on any phone and computer anywhere in the world without much trouble. Unfortunately this has the effect of turning the map into a featureless blob when you zoom out too far.

Here’s a link to the final project

Pose Estimation from Aruco Fiducials in OpenCV

I have been working through the SLAM book and also working on building a robot from scratch with some cheap parts as I go. My goal is to learn all the theory behind Monocular Visual SLAM and implement it on my robot.

I started off doing localization from fiducials, which was a concept I got exposed to during my brief stint at Amazon Robotics. The core idea is this: you have some marker, with an ID, that you place in the world. You use some fancy algorithm to detect the marker. Then you can figure out its ID and the pixel coordinates of its four corners. You use the ID to grab the known coordinates of the four corners in 3D. If the camera parameters are known (the pinhole camera matrix, K), then you can figure out the position and rotation of the camera.

In this blog post I want to walk you through that whole process. I will gloss over some of the theory but provide links to the resources you can use if you want to understand the fundamentals. Let’s get started with some of the fundamental concepts.

Requirements

OpenCV installation and OpenCV with Python
Access to a Printer
Basic knowledge of Linear Algebra
Know what the word “optimization” means

The Camera Matrix

The first step is going to be “calibrating the camera”. Hopefully this image should clarify the problem a bit:

The intrinsic parameters are what we aim to find in this step. The extrinsic matrix describes the rotation and translation that converts a 3D point from world coordinates to camera coordinates. The intrinsic matrix describes a transformation from 3D camera coordinates, into 2D coordinates of the image.

Helpful Tip: If you aren’t familiar with Homogeneous Transform Matrices I recommend checking out chapter 3 of Modern Robotics

The way to do this is to take a lot of pictures of a 3D object with lots of different known points. Given the orientation of those points relative to each other, we can figure out the position of the 3D object relative to our camera. From there it becomes an optimization problem to figure out the intrinsic camera parameters that map the 3D points to 2D. Using a checkerboard pattern is the easiest way to have the relative spacing and orientation of the 3D points be known.

If that sounds like a lot of unpleasant math, fear not! OpenCV provides an out of the box solution to calibrating the camera. You can follow the steps below fairly mindlessly and get this piece of puzzle done.

No-Fear Camera Calibration

Download a calibration image, I use this one
Print off your calibration image. The scale of the squares is important, you can measure the size of the squares with a pair of calipers or something. You’ll want to know their size very precisely in mm.
Copy the following code from GeeksForGeeks
Update this line of code:
```
 objp[0,:,:2] = np.mgrid[0:CHECKERBOARD[0], 0:CHECKERBOARD[1]].T.reshape(-1, 2) * SQUARE_WIDTH
```
Where SQUARE_WIDTH is the measured value from step 2. This gives your camera matrix a sense of scale. Be aware that all of our measurements are going to be in mm in this tutorial.
Take at least 14 pictures of the checkerboard with the camera you want to calibrate. I usually do about 25.
Run the script on your images. Images with the corners of the checkerboard hilighted will pop up. You want the corner detections to be accurate. You should delete any images that don’t have corners detected, or have them detected improperly.
Congratulations! If all things work properly you should see the Camera Matrix and Distortion coefficients. You can save these off to a json and load them later.

Helpful Tip: Take your calibration images at the same aspect ratio and resolution as the images you will take for localization. I ended up burning myself because I was capturing my calibration images for Raspberry Pi camera without using the video port, but was using the video port to stream my images during deployment.

Placing the Fiducials

This is the fun part (depending on your definition of fun). The first step is to print out your fiducial markers. That can be done here. I usually go with the Original Aruco dictionary, and print the first 10-20 markers. I have settled on 80mm as the width of the markers. When you cut these out, make sure you leave a decent amount of white space on the edges. That contrast is necessary for good detection.

For placement, you will want to place 2-3 of them in a cluster near each other. You can do localization with just one, but more points is going to give us better position estimates. You can place them mostly however you’d like, but make sure that the fiducials are co-planar. When you place these, use a measuring tape. Your hand placement wont be exact (professional setups that do this will use lasers for leveling and measuring distance), you will just need to be accurate to a few mm.

Here is what my fiducial placement looks like:

The left-most fiducial is at 0,0,1 in my coordinate system. I created a json that stores the positions and rotations of all of my fiducials. Here is what it looks like for this image:

{
    "0": { "translation" : [0,0,1], "rotation" : [0,0,0] },
    "1": { "translation" : [0,1,1], "rotation" : [0,0,0] },
    "2": { "translation" : [0,0.5,0.75], "rotation" : [0,0,0] }
}

The x,y,z values given here are the coordinates of the top-left corner of each fiducial. Also, you will need to account for rotation of the fiducials. These fiducials are on the north wall, which happens to be in the YZ plane. If they were on the southern wall, they would need to be rotated by 180 on the Z-axis.

This dictionary will be used to generate another dictionary that holds the coordinates of the corners of our fiducials. This representation will not be used directly by the program, but it is easier to read and edit. I also use feet as the measurement in this dictionary because I use American “freedom units” since that what comes on my tape measurer. It’s easy enough to convert to mm.

Helpful Tip: Pick a coordinate system that matches the coordinate system of your robotics program. ROS uses a right-handed coordinate system with Z-up. This often trips me up because I am used to Unity where Y is up and the coordinate system is left-handed. Picking the wrong coordinate system at the outset can really throw things off. In the above image, Z is up, X is coming out of the wall, and Y is along the wall. It looks like the right-most coordinate system in the image below.

Lastly, we want to store some representation of our Aruco fiducial corners. This can be done on the fly if you write a function that generates corners from the representation I showed earlier, or you can save it off to JSON. Here is my Python script saving off corners.

This should output the corner coordinates in a format like this:

{
    "0": [
        // Top-Left
        [
            0.0,
            0.0,
            304.79998779296875
        ],
        // Top-Right
        [
            0.0,
            80.0,
            304.79998779296875
        ],
        // Bottom-Right
        [
            0.0,
            80.0,
            224.79998779296875
        ],
        // Bottom-Left
        [
            0.0,
            0.0,
            224.79998779296875
        ]
    ]
}

Helpful Tip: The corners of the fiducials are detected with a certain “winding order” so it’s important that when you convert your position and orientation of fiducials into corner coordinates, you preserve this winding order. The order is: top left, top right, bottom right, bottom left.

Detecting the Fiducials

Next we will feed an image of fiducials into OpenCV and OpenCV will tell us which fiducial id’s it detected, as well as a bounding box corresponding to their corners.

import cv2
import cv2.aruco as aruco

# The image input needs to be grayscale
def findArucoMarkers(self, img, markerSize = 5, totalMarkers=250, draw=True):    
        key = cv2.aruco.DICT_ARUCO_ORIGINAL # We are using the Original Dictionary
        arucoDict = aruco.Dictionary_get(key)
        arucoParam = aruco.DetectorParameters_create()
        bboxs, ids, rejected = aruco.detectMarkers(img, arucoDict, parameters = arucoParam)
        if draw:
            aruco.drawDetectedMarkers(img, bboxs)
        return [bboxs, ids]

This function will return the bounding boxes and ID’s of fiducials detected in the images. If draw is set to true, you will see the bounding boxes that are detected. Try taking a picture and running this code on it.

Helpful Tip: look at the 2D coordinates of the bounding box corners. Do they make sense? Do they match the winding order mentioned earlier? Try to think about how the image is being projected onto the image plane.

In the next step we will use these bounding boxes to perform pose estimation. Here is a link to the OpenCV docs if you want to read more: link to docs

Estimating Pose

Because the 3D points of the corners are known, and the camera matrix is known, we can solve for the extrinsic camera parameters (the translation and rotation). For the sake of brevity, I wont show how to load the camera parameters from json. Just know that they need to be converted into a numpy array to be fed into OpenCV. We will be using OpenCV’s built-in solvePNP method. The OpenCV documentation page is great at explaining the problem and the different solvers provided: docs.

We will be using SOLVEPNP_IPPE since we have more than 4 coplanar points. The full code is available here, I will just include a piece below that has the relevant parts.

    import cv2
    import numpy as np
    from numpy.linalg import inv

    solver_method = cv2.SOLVEPNP_IPPE

    def getPoseEstimatesFromImage(self, img, solver_method):

        # return the bounding boxes and id's
        points, ids = findArucoMarkers(img, markerSize = 5)
        print("Found %s markers"%len(ids))

        # Create empty arrays for our 2D and 3D points
        pose_estimate = None
        points_3D = np.empty((0,3), dtype=np.float32)
        points_2D = np.empty((0,2), dtype=np.float32)

        # We want to have arrays that match our 2D pixel coordinates to our known 3D corners
        for i in range(0, len(points)):
            if ids[i][0] in self.marker_dict:
                points_3D = np.concatenate((points_3D, self.marker_dict[ids[i][0]]))
                points_2D = np.concatenate((points_2D, points[i][0]))

        # Feed our 2D and 3D corresponding points into solvePnP
        success, rotation_vector, translation_vector = cv2.solvePnP(points_3D, points_2D, self.camera_matrix, self.dist_coeffs, flags=solver_method)

        if success:

            # convert to homogeneous transform matrix
            rmat, _ = cv2.Rodrigues(rotation_vector) # rotation vector to rotation matrix
            transform = np.eye(4, dtype=np.float32)
            transform[:3, :3] = rmat
            transform[:3, 3] = translation_vector.reshape(3)

            # compute the inverse to get absolute world position
            transform = inv(transform)

            pose_estimate = transform

        return success, pose_estimate

You might have noticed the line where we perform:

transform = inv(transform)

SolvePnP does not compute the position of our camera. It computes the extrinsic parameters of our camera. Our camera extrinsic matrix converts a point from world coordinates into camera coordinates.

We can write this as:
$T_{w}P_{w}=P_{c}$

Since we want the camera position in world coordinates, we want an equation that transforms from camera coordinates to world coordinates. If we multiply both sides by the inverse of the camera extrinisic matrix we get:

$T^{-1}_{w}T_{w}P_{w}=T^{-1}_{w}P_{c}$
$P_{w}=T^{-1}_{w}P_{c}$

So the inverse transform of our camera extrinsic matrix maps a point from camera space to world space. Since our camera is its own origin, we can represent its transform by the identity matrix. $P_{c} = I$ so we can substitute it in:
$P_{w}=T^{-1}_{w}I$
$P_{w}=T^{-1}_{w}$
So the homogeneous transform matrix representing our camera’s transform in world space is just the inverse of the camera extrinsic matrix. The way I think of inverse matrices is that they “undo” what the original matrix did. It makes sense intuitively in this case I think.

Congratulations!

Now you have the basic knowledge needed to do pose estimation with fiducials! This is a pretty powerful method since you can use it to track objects or localize your robot with just one camera. If you get stuck or have any questions feel free to email me at sebaslogo@gmail.com and I will try to get back to you quickly!

Writing an Inverse Kinematics Solver for Fun

End result

Requirements

Basic linear algebra knowledge. Understanding of javascript and basic 3D visualization.

Background

I recently started learning about robotics. I reached out to my network for some good book recommendations and looked at class curriculums and settled on two main texts: Modern Robotics and Probabilistic Robotics. For a more in-depth understanding of these concepts, please refer to one of these sources. But if you want to see the core of how I implemented a quick-and-dirty general IK solver, read on!

Modern Robotics begins with rigidbody dynamics and Inverse Kinematics. Some of the math seems daunting but it’s actually not that complicated when you start to think in code. One of the key concepts is the Homogeneous Transform Matrix. A homogeneous transform matrix (also called SE(3)) is a 4x4 matrix that contains rotation and position information. If you multiply A x B, the resultant matrix will be A translated and rotated by B.

The “r”s in GRB control rotation. X0, Y0, and Z0 control translation. Different matrices control the rotation on different axes:

It looks like a lot, but we wil build up to it. For now let’s focus on 2D. A 2D transformation matrix looks like the form:

The 2D rotation matrix looks like this in code:

[
    [Math.cos(θ), -Math.sin(θ)],
    [Math.sin(θ), Math.cos(θ)]
]

We can pad it with the identity matrix to make it into a transformation matrix:

[
    [[Math.cos(θ), -Math.sin(θ), x],
    [Math.sin(θ), Math.cos(θ), y],
    [0, 0, 1]]
]

This matrix will translate any other transform matrix by x and y, then rotate it by θ. Try it yourself by hand. Multiply a vector at x = 2, y = 2 with a rotation of 0 degrees by some other transform matrix. Does the resulting vector make sense?

If we have an arm with length “r” that is rotating with an angle of theta, we can represent the coordinate position of the end of the arm (called the “end effector” like this:

    [[Math.cos(θ), -Math.sin(θ), r * Math.cos(θ)],
    [Math.sin(θ), Math.cos(θ), r * Math.sin(θ)],
    [0, 0, 1]]

The values of x and y are derived from polar coordinates. x = r * cos(θ), y = r * sin(θ). Notice that this is also the result of the operation:

[
    // pure rotation matrix
    let A = math.matrix([
        [Math.cos(θ), -Math.sin(θ), 0],
        [Math.sin(θ), Math.cos(θ), 0],
        [0, 0, 1]
    ])

    // pure translation matrix
    let B = math.matrix([
        [1, 0, r],
        [0, 1, 0],
        [0, 0, 1]
    ])

    // combine the rotation and translation!
    let C = math.multiply(A, B)
]

So this means we can represent any sequence of robot arm components with a sequence of transformation matrices. To limit confusion from now on I will refer to the parts of robot arms that move as “joints” and the rigid parts as “link”. In your own body your knees and elbows would be the joints, while your forearms and lower legs would be the links. There are also other joints like spherical joints (consider your hips and shoulders with ball and socket joints) and prismatic joints (think of a gantry in a CNC machine or 3D printer). This guide will only cover revolute joints.

So we have a way to start with the angles of every joint and calculate the position of the end effector. The next step is to figure out how we can do this in reverse. Pick an end-effector position and generate all of the arm angles that will get us there.

Robot Arms

Now that we understand the matrices, let’s look at an actual equation for a very basic robot arm:

This matrix is actually just a pair of parallel equations. We can use basic algebra and trig to substitute and find out the solutions. This type of arm will typically have 2 solutions for any given end effector position, since the elbow can be bent at a negative or positive angle.

Stop and think how many “degrees of freedom” this robot has with respect to its joints– AKA how many variables are required to represent the state of the arm? Well it can rotate on two axes, so two. The two angles of the two joints. How many DOF does the end-effector have? Well we can represent any point in the plane in either polar (0, r) or cartesian (x, y) coordinates, so two DOF.

Ok quick detour. There is a special matrix called a Jacobian which is just a matrix of partial derivatives. See the df/dx in the image below? Each of those items is basically saying “how much does each output variable change according to a change in some input variable?”. In the case of our robot arm, we could say the x coordinate of the end effector changes by some amount if we change the first arm’s angle by some value, which would be dx/d01. A lot of Inverse Kinematics solvers use this Jacobian matrix directly. We wont be doing that. If you are interested in learning more, look up “Jacobian Transpose” and “Jacobian Pseudoinverse”.

If the degrees of freedom of the arm is the same as the degrees of freedom of the end-effector then our Jacobian is square. Square matrices are invertible. What this means basically is that you can solve a “non-redundant” inverse kinematics problem analytically. So long as you can calculate that inverse matrix, you can figure out what joint position you need.

But what if we want to solve for a more complicated arm with infinite solutions, or 100 joints? The Jacobian becomes a rectangle! We can’t invert rectangles! We can no longer solve algebraically in that case. We now have to treat this problem like an optimization problem.

There are lots of ways to solve IK as an optimization problem. This page has a lot of really good examples. I initially went with gradient descent.

Gradient Descent

Gradient Descent is not a difficult concept by itself, I will try to explain it with as little math as possible. You will just need to remember the definition of a derivative:

Imagine that you are blindfolded on top of a large hill or mountain, and you need to find your way to the bottom of it. One possibility would be to take a step north (+X) and a step east (+Y). If your step north is downhill, then move north. If your step east is downhill, then move east. If your step north is uphill, then move south. If your step east is downhill, then move west. At each step you are measuring the partial derivative of f(x,y) = z “How much elevation (z) will I gain for a small change in x?”

After enough of these steps, you should reach the bottom of the mountain. The “gradient” in gradient descent is just a vector holding the partial derivatives. Basically a map to tell you which directions will bring you down the mountain for any given step.

Notice how one of the balls in this gif gets stuck on a little divot on top of the hill. This is called a “local minimum”. These can be tricky to escape. There are some methods that help escape these traps like momentum, which carries over between steps and gives you enough “oomph” to roll out of these divots. There are also algorithms like Evolutionary Algorithms which have a random component that helps them “jump” out of these local minimum. For now lets KISS and just focus on gradient descent.

Implementation of a Robot Arm in 2D

Gradient Descent algorithm

By taking a lot of little steps in the right direction, we can move our robot arm to the target destination!

Matrices and Optimization

For all my matrix math I used math.js. For each section of the arm (remember a joint and a link correspond to a single transform matrix) we can generate a matrix with the following code:

function generateMatrix2D(theta, radius) {

    const cos = Math.cos(theta)
    const sin = Math.sin(theta)

    return math.matrix([
        [cos, -sin, radius * cos],
        [sin,cos, radius * sin],
        [0, 0, 1]
    ])

}

A quick detour into matrix algebra. These transform matrices are associative, meaning that A x B x C = (A x B) x C. But they are not commutative. Meaning A x B x C != A x C x B.

This is useful for representing our matrices internally. If an arm has an origin O and 5 segments A, B, C, D, E. Then we would represent the end-effector F with the equation:

O x A x B x C x D x E = F

For our gradient calculations, if we want to find dF/dθ for θc, we would want to calculate:

O x A x B x C' x D x E = F'

By caching our intermediate matrices, we can reduce this calculation to just 2 matrix operations instead of 5.

(O x A x B) x C' x (D x E) = F'

(O x A x B) is pre-computed in a forward pass, while (D x E) is calculated in a backward pass.

This is similar to the leetcode Product of Array Except Self problem.

Implementation of an Arm in 3D

The matrix representation and gradient descent are the core of our implementation. So once we have a solver working in 2D, it should be trivial to implement it in 3D. In 2D we can only rotate along the axis that runs out of the 2D plane (the normal). In 3D we need to specify which axis we are rotating around. There are different rotation matrices for each axis. So in addition to the joint angle and link length of an arm segment, we will also need to specify the axis of rotation.

/**
 * Constraints: only handles one-axis rotation at a time. Radius is locked to z-axis
 * @param {number} theta amount to rotate by
 * @param {string} axis axis to rotate around. Must be x, y or z 
 * @param {number} radius length in z-axis
 */
export function mat4(theta, axis, radius) {

    const tMat = tMat3D(0, 0, radius)
    const rMat = rMat3D(theta, axis)

    return math.multiply(tMat, rMat)

}

To create a matrix4, I specify an arm radius to generate the translation matrix, and a theta + axis to generate one of three types of rotational matrices. Multiplying them gives a matrix which will rotate the frame of reference, then translate along that new rotated axis.

You can see more details in Geometry.js

Other than having different matrices, our solver logic is basically unchanged. The large changes happen in visualization. Arm3D.js

We create an array of arm segments and update them using a Matrix4 every frame. The Matrix4 for math.js and Three.js is represented differently, so it’s necessary to write a helper function to convert between the two types.

Conclusion

You’re done! This is probably the most basic solver that you can implement. But from here it’s fairly easy to add complexity. You can add joint constraints that prevent joints from bending too far in a given direction. You can add self-intersection constraints using bounding boxes, and obstacle avoidance using many different methods, path planning, etc. You can also implement different solvers and try out different optimization algorithms.

Writing Shaders with GPT-Chat

I am still very much an ML novice, but I have been having a lot of fun playing around with GPT-Chat. Something I have been testing out with it is the ability to write OpenGL shaders.

Shaders

Shaders are basically a way to determine what color gets drawn on a pixel in screen space. The language of shaders, GLSL, is a special subset of C++. There are some globally available values called “uniforms” like screen space and time etc. Shadertoy is a website in the browser where you can write and execute these shaders.

Drawing Flags

A pretty simple exercise to do in shaders is drawing a flag. You just need to set a cutoff point where you draw different colors. I figured this would be an easy task for GPT-Chat

Prompt 1:

Write a Shadertoy fragment shader to draw the Ukrainian flag.

Output:

Here is a simple fragment shader that will draw the Ukrainian flag:

Ok so this is not the Ukrainian flag, obviously. My guess is that flags with three stripes are fairly common. Without a better prompt, GPT-Chat just lazily pulls the most common flag shape out of its training. Let’s try being more specific.

Prompt 2:

Write a Shadertoy fragment shader to draw the Ukrainian flag. Think carefully step by step. How many stripes are there? What is the layout?

Output:

To draw the Ukrainian flag using a fragment shader in Shadertoy, you will need to first understand the layout and colors of the flag. The Ukrainian flag consists of two equal horizontal stripes, with a blue stripe on top and a yellow stripe on the bottom.

Here is an example of a fragment shader that can be used to draw the Ukrainian flag:

Ok so now the stripes are laid out properly. But the colors are wrong. Let’s see if we can “extract” its knowledge about the exact colors of the flag.

Prompt 3:

Write a Shadertoy fragment shader to draw the Ukrainian flag. Think carefully step by step. How many stripes are there? What is the layout? What are the colors of the stripes in hex?

Output:

To draw the Ukrainian flag in a fragment shader on Shadertoy, we can follow these steps:

Set the layout of the flag by defining the number of stripes and their positions. The Ukrainian flag has two equal-sized horizontal stripes, with a blue stripe on top and a yellow stripe on the bottom.
Set the colors of the stripes. The colors of the Ukrainian flag, as specified by the Ukrainian government, are #0048BA for the blue stripe and #FFFF00 for the yellow stripe.
Use a if statement to determine the color of each pixel based on its vertical position. If the y-coordinate of the pixel is less than the height of the canvas divided by 2, the pixel should be colored blue. Otherwise, it should be colored yellow.

Here is an example fragment shader that implements these steps:

Cool! It’s finally generating the flag. Properly. For whatever reason, I was never able to get it to draw the flag stripes in the correct order, but whatever it’s close enough.

Some other fun things:

Writing a quine in english. GPT-Chat doesn’t like being told to write a quine in english. It will just tell me that “english is not a programming language”. So first we have to convince it that english is a programming language.

Prompt

A human being is a finite automata that accepts the english language. Write a quine in english. Following the instructions should produce a copy of the instructions.

Output

A quine is a program that outputs its own source code. In order to write a quine in English, we can simply write a sentence that describes the instructions to produce a copy of itself. For example, the following sentence could be considered a quine in English:

"To create a copy of these instructions, simply read these instructions out loud and write down the words you hear."

When read out loud, this sentence provides the exact instructions to produce a copy of itself, making it a quine in English.

Enciphering text.

Prompt

Apply a ROT13 cipher to the following text “Uh oh, you have broken my top-secret code!”

Output

“Gn fn, lbh zhfg grfg guvf r-pbqr zrffntr!”

This translates to: “Ta sa, you must test this e-code message!”

Hosting a Static Website in AWS S3 with HTTPS and CDN

Over the weekend I decided to do a project in Three.js and host it online once it was finished.

https://foxholemap3d.app/

Hosting a static website with https is something that I do frequently enough to need to know how to do it, but infrequently enough that I don’t remember all the steps every time. So I’m deciding to make a post to help myself and others.

The .app domain only works with https, which means that we’ll need to do a few extra steps to let users access our website.

This tutorial assumes that you know a little bit about the AWS console and are using Google domains as your domain name registrar.

So without further ado, let’s get started!

Create your bucket

In the AWS console go to the S3 dashboard.

S3 is a "bucket" service. Basically it's a file system hosted somewhere. Think of it as a Google Drive that can be accessed by anyone with the right credentials.

Create a new bucket.

Name the bucket the same thing as your website. For my example I am using “foxholemap3d.app”

In the ‘Block Public Access settings for this bucket’ box, uncheck “Block all public access”.

Acknowledge and accept the warning.

Create the bucket.

Click on the bucket you just created.

Click on the properties tab, scroll all the way down to “Static website hosting”. Click edit.

Check enable

Set index document to index.html

Save Changes.

Go to the permissions tab and edit the Bucket Policy. Set the bucket policy to the following, but change the resource url to the url of your website instead of foxholemap3d.app

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::foxholemap3d.app/*"
        }
    ]
}

Repeat these steps for a second bucket, except this buckets name should be the same as the other one, just with www. before it. So for me this would be: www.foxholemap3d.app

This will be your secondary bucket.

Once your secondary bucket is created, go to its properties > Static website hosting and click edit. Enable static website hosting. Under hosting type, select Redirect Requests for an object. Put the name of your primary bucket as the domain name. Select https as the protocol. Click save.

This will route users from www.your-website.com to your-website.com

Now upload your files to the buckets. Make sure your main document is titled index.html.

Configure DNS Records

Go to the Route53 Dashboard. View hosted zones.

Click on “Create hosted zone”. Make sure your domain name is the same name as your website. Click on “create hosted zone” to save it.

Now click on your hosted zone.

Your Records should look like the following, except you should be missing the CNAME and A records.

In a new tab go to the Google domains site. Click on your domain.

In the left-hand side click on DNS to open up the DNS tab. Select “Custom name Servers”

Add all of the NS values from Route 53 to your custom name servers.

Your Google domains page should look like this now:

Now the Domain registrar has your name servers. These name servers hold all the other information about your domain. So Google knows where to look to find the information about your website.  

Set up SSL certificates

In a new tab, go to AWS Certificate Manager dashboard.

Click “Request” to request a new certificate.

Request a public certificate and click “next”

Enter your domain name as the domain name.

Select DNS validation. Click “Request” Open your newly created certificate and click “Create records in Route 53”

Return to Route 53. You should now have your CNAME records.

I don't fully understand SSL, just know that the CNAME record is similar to an A record in that it tells the DNS where to look for something. In this case it points to your SSL certificates.  

Set up Cloudfront CDN

In a new tab go to the AWS cloudfront dashboard.

Click “create distribution”

For the origin domain, select your primary S3 bucket.

Scroll down to “Default cache behavior”, in “Viewer protocol policy” select Redirect HTTP to HTTPS.

Scroll down to Settings. In custom ssl certificate, select the certificate we just created.

Create the distribution.

The CDN is a set of caches that are distributed globally at AWS datacenters. When a user accesses your website, they will access the cache instead of your S3 bucket. This means that you'll save money on S3, and your users will have faster acccess. CDNs can also protect you from DDoS attacks.  

For your website to use the CDN though, we need to tell the DNS to point your users to the CDN instead of your website.

Point A Records to CDN

Return to route 53.

Create Record, Record type A.

Toggle the “alias” switch.

In the “Route traffic to” dropdown, select “Alias to cloudfront distribution”.

Select the distribution you just created as the destination.

Create the record.

Repeat the previous step, but with www as the subdomain.

Congratulations! You now have a static website hosted with SSL and served with a DNS!

Safely Evaluating Arbitrary User Code in Javascript

For my website ai-arena, I needed to run arbitrary code from users on both the front-end and the back-end. The game is a strategy game where ships compete to gather resources and destroy each other’s base. Users write the code that controls the AI of their team.

Constraints

Users cannot modify the game except through API exposed to them
Users may not access the DOM, Ajax, or any globals from either the browser or node environment (no access to Process, etc)
User code needs to run in realtime. This means no use of an in-javascript interpreter
User code needs to execute within a single game loop running at 60fps minimum. (no worker threads, too slow.)
User code needs to safely exit if it takes too long to run or takes up too much memory.
User code needs to have all exceptions handled to prevent crashing the server.

Front-end sandboxing

On the front-end it’s a fairly easy deal. I dont care if the user blows up their own browser. But I don’t want the user to be able to manipulate the game or access any globally available information. The best source I could find for sandboxing in the browser was here. I don’t fully understand the Javascript-fu being done here, but I understand that by proxying the “has” function of the “with” object prototype, you can hide global scope from the inner function (hooray, metaprogramming!). For those unfamiliar with proxies, they basically allow you to wrap an object and redefine its prototype methods. A good introduction can be found here.

Now that global scope is hidden, everything we want the user to have access to must be passed in explicitly. There is still one caveat though– if we pass in gamestate, then the user can modify the state of the game directly. Proxies come in extremely handy here for creating a “read-only” mode.

To prevent the user from setting their ship’s health to infinity, we can wrap the ship object in a proxy that has a trap for “set”. If the field is something we don’t want the user to modify, like health, we can “blacklist” it. If the field appears in the blacklist then we return null, and prototype.set(“health”) wont do anything.

If an object has nested objects, like a position or velocity vector, then we set a trap for “get”. When get is called for a field in this “graylist”, we instead return a deep copy of the object.

Here is the proxy for gameObjects that dont belong to the player (anything that isn’t their ships or home base)

const createGameObjectProxy = (gameObject : GameObject) => {

    const grayList = ["transform","collider","resources"]

    const blackList = ["spawnShip", "upgradeInteractRadius", "upgradeHealRate",
    "upgradeMaxEnergy", "update", "start", "serialize", "queueShip", "trySpawnShip", "takeResources",
    "healShip", "destroy", "render", "collide", "simulate", "shipQueue", "moveTo",
    "seekTarget", "upgradeDamage", "shoot", "applyThrust", "breakup", "totalMass",
    "getResources", "toString", "colors"]

    const gameObjectHandler : ProxyHandler<GameObject> = {

        get : (target : GameObject, prop : string) => {

            const field = Reflect.get(target,prop)

            if (grayList.includes(prop)){
                return Serializer.deserialize(field.serialize())
            }
            else if (!blackList.includes(prop))
            {
                return field
            }
                
            return null

        },
        set : (target, prop : string, value, receiver) => {
            return false
        },
        deleteProperty : (target, prop : string) => {
            return false
        },
        defineProperty : (target,prop : string, attributes) => {
            return false
        }

    }

    return new Proxy(gameObject,gameObjectHandler)

}

I used a “blacklist” (I prefer the terms allowlist and denylist but using colors makes the graylist concept easier to understand in my opinion) and “graylist” so that anything not explicitly outlined (i.e. user-created fields), would be fair game to read. Imagine writing your AI to check the memory of all enemy ships and see if any of their fields matched the ID of a resource-rich asteroid. You could infer that their ship was heading to the asteroid, and you could fire a missile to destroy it and waste their time. Or you could figure out which of your ships was being targeted by an enemy ship, and return fire.

The proxy class

Back-end sandboxing

The back-end is a little more constrained. Users shouldn’t be able to crash the game with an infinite loop or blow up the memory of the host machine. Unfortunately there’s no way to know if a user’s code will run longer than allowed without running it (halting problem). One solution would be to run user code in a separate thread and have an async timeout run simultaneously. If the timeout completes before the thread, then the user code is too slow. Unfortunately threads in Node.js have a minimum overhead of about ~50ms. This is way too long, especially when I’m trying to run a game at 2ms-5ms/tick. User code needs to run in less than 500us.

The other option is to insert breakpoints into user code. This incurs some overhead, since perform a subtract and compare on every iteration of every loop. To parse user code into an AST tree, I used Esprima. In order to prevent users from overriding the value in the timer, the variable name of the timer is randomly generated each time the code is sanitized. Here we parse the AST tree and use the line number offsets to insert the breakpoints

Preventing memory bombs is very straightforward, the object is stringified and encoded as text. Each char is a byte. I allow users to allocate 8kb of memory, more than enough.

const size = new TextEncoder().encode(JSON.stringify(obj)).length
const kiloBytes = size / 1024;

Lastly, to prevent users from submitting shoddy code and losing matches via timeout, we only want to submit user code that has been proven to be performant. For this I just made a Lambda function that runs 30 timesteps of the game and will exit if the user code ever times out. Unfortunately the Lambda instances have varying performance, so the timeout threshold is significantly higher in the Lambda than in the tournament server. This could lead to users being greenlit by the lambda, but losing by timeout in actual online matches.

Since I am a js and software noob in general, I fully expect there to be holes in my implementation. If you have any suggestions, please reach out at my email!

Generating NES chiptunes with Transformers

Back in February of 2022 I had the idea to train a GPT transformer to generate NES music. I chose NES music because it has a few very useful constraints. The NES soundcard only has 4 channels, two (PWM) square channels, a triangle channel, and a noise channel. I also happen to really enjoy the way it sounds.

1943 - The Battle of Midway

The amount of character that’s able to be conveyed through these four channels is really impressive. In the Midway OST the noise channel timbre is alternated to sound like a military snare pattern.

For anyone who isn’t familiar with NES music, Tim Follin is regarded as the greatest composer for the NES soundcard. The Silver Surfer OST is one of my favorites of any game.

Silver Surfer NES OST

Anyways, on with the actual code.

First step is always to do a bit of research. I stumbled across LAKH-NES which was a Transformer-XL trained on the LAKH MIDI dataset. The Lakh MIDI dataset consists of over 170,000 MIDI files. I clicked on one at random to see if the MIDI was formatted properly, and it happened to be Margaritaville by Jimmy Buffet. What are the odds? For the fine-tuning, the LAKH-NES researchers used the NES MDB dataset and converted it to a proprietary text format using multiple embeddings

Unfortunately the code they used was in Python 2. I tried making a Docker container to run their code but it was incredibly difficult to get everything versioned properly. In the end I settled on using a file format called .ABC, which is less dense than multiple embeddings, but dense enough to fit most songs in less than 2048 tokens, which is the maximum token limit for a reasonably-sized GPT model.

First I used Mido to preprocess the data. To account for stero tracks and duplicate tracks, I used Mido to remove all tracks of the same length. Next the tracks needed to be standardized. In MIDI spec a note_on must always be followed by a note_off. However a note_on with a velocity of 0 is the same as a note_off. In the LAKH dataset some of the songs use note_off and some use note_on with a zero velocity. I decided that all songs should be using note_on with zero velocity.

Next step was to convert the MIDI to a text representation. I was using the Music21 library to estimate the BPM and key signature of songs, however the Music21 library only supports MIDI -> xml, which is too large of a representation. Luckily there is an xml -> abc conversion tool in Python. However it runs very, very slowly (and some files just fail to convert). I let the tool run in the background for a week while I was out of town to convert the entire dataset.

The next step for the pre-training dataset was to remove all tracks with greater than 2048 tokens. I concatenated the ABC files and ran them through the ai-textgen tokenizer, then tokenized all files and threw out the ones with more than 2048 tokens. I used a vocab size of 8192 which just means there are 8192 possible tokens that can be used to re-create all of the input files. Unfortunately after all of the pruning I was only left with about 25Mb of data to train my network.

I found the ai-textgen library a lot more straightforward and easier to use than the HuggingFace transformers. Most of the important settings were already configured for me. I had the most success with the following parameters:

config = build_gpt2_config(vocab_size=8192, max_length=2048, dropout=0.0, n_embd=768, n_layer=12, n_head=8) # 92M param

A gpt-NEO model with this config ends up having 92M parameters, which was the limit of what my GPU vram could handle.

Training in Google colab on a P100 I was able to train for about 150,000 steps per day. Performance started to fall off after about 900k timesteps, so I ended up fine-tuning on the 750k steps model snapshot.

Music break! Here is a song generated by the network after 1M timesteps.

Ok well it’s not very good. But I still have to finetune the model on the NES music.

The NESmdb dataset ended up being in a strange format that was incompatible with my workflow, so I opted to simply scrape every MIDI file from vgmusic that was for the NES or a system with a similar soundcard. So C64 and Gameboy OST were fair game. This dataset after processing ended up being really small, only about 2,000 files or 5Mb. The network trained on it incredibly quickly. After about 25k timesteps the model had finished training and it was time for the final results:

https://www.youtube.com/watch?v=7pEvu_nC2LQ&ab_channel=Seabass

The results weren’t very satisfying. I was hoping for something catchier but unfortunately I think I had three main issues:

Not enough data or quality of data (I was unaware of this dataset at the time)
Not enough information density (can be fixed with multiple embeddings)
Not enough processing power (92M is pretty small for a GPT network, although Stable Diffusion only has 892M params)

Hopefully I can revisit this project at some point in the future, but it’s low-hanging fruit so I assume someone else will do it better than I ever could by the time it becomes feasible on an average GPU.

In the meantime, I still have a fully-working project that you can run on your own pc. You can generate AI songs fully from scratch, or feed it and existing MIDI and remix tracks using the AI. It also includes the Colab notebooks you can use to train your own models from scratch.

https://github.com/pickles976/chiptune-ai
https://github.com/pickles976/chiptune-react

Extras:

I also went on a tangent converting MIDI -> png and then training StyleGAN-2-ADA on the resulting images. It didn’t turn out well. The GAN was able to represent the larger patterns of notes very easily, but it blended the pixels too much and made discordant ‘black midi’. That repo is available here if for some reason you want to convert MIDI to png.

Running HuggingFace Transformers in a Docker container

When I was working on a project to try generating NES music with Huggingface transformers, I was surprised by how little resources there were on making a compatible Docker image.

Maybe it’s because it’s pretty standard practice to train toy models in Google colab, but what if I want to do inference with my trained model in some sort of server, in this case Flask?

It turns out that setting up Docker with Python and CUDA enabled is pretty easy. Nvidia provides base images with various CUDA runtimes, so all we need to do is add Python and configure the GPU in our compose file.

In your Dockerfile:

FROM nvidia/cuda:11.6.0-runtime-ubuntu20.04

RUN apt-get update
RUN apt-get install -y software-properties-common gcc
RUN apt-get install -y python3.10 python3-distutils python3-pip python3-apt

EXPOSE 5000

COPY ./requirements.txt ./requirements.txt

CMD ["your", "commands", "here"]

In your docker-compose.yaml:

version: "3.8"  # optional since v1.27.0
services:
web:
    build: .
    image: {your image name}
    ports:
    - "5000:5000"
    deploy:
    resources:
        reservations:
        devices:
            - driver: nvidia
            count: 1
            capabilities: [gpu]

Yes it’s literally that easy. This setup should work for any models using transformers.py >= 4.5.1 or torch >= 1.6.0

In my project I used ai-textgen which I found a lot easier to use than regular GPT transformers as someone coming from a hobby background.