Artificial Intelligence

By Gabriel Mukobi

Machine learning/deep learning/reinforcement learning/other AI projects with a focus on AI security, including both technical research projects and some community resources for helping others learn.

CAISI

Since 2024, my AI research focus has been the Center for AI Standards and Innovation (CAISI) within NIST, which serves as an industry point of contact within the U.S. Government for AI testing, evaluations, collaborative research, and standards work related to the security and measurement of advanced AI systems.

Responsible for: Supporting CAISI's work on AI evaluation, standards, agent security research, and national security partnerships.

Societal Adaptation to Advanced AI

A paper describing a complementary AI governance strategy focused on societal adaptation: reducing the possible negative impacts from a given level of advanced AI through adaptive interventions that help society avoid, defend against, and remedy harmful uses.

Responsible for: Co-authoring the paper.

Escalation Risks from Language Models

A wargame-based evaluation framework for studying whether large language model agents take escalatory actions in military and diplomatic scenarios, including arms-race dynamics and, in rare cases, nuclear escalation.

Responsible for: Co-authoring the paper.

Welfare Diplomacy

A benchmark for language model cooperation built around a general-sum variant of Diplomacy where agents must trade off military conquest against domestic welfare, enabling clearer evaluation of cooperative capabilities and exploitability in multi-agent settings.

Responsible for: Leading the project and co-authoring the paper.

SuperHF

An LLM post-training research project to develop alternatives to reinforcement learning from human feedback (RLHF) that use supervised learning instead of PPO-based RL and don't change general capabilities while leading to better downstream human preference performance.

Responsible for: Co-leading the branch of the project focused on developing Supervised Iterative Learning from Human Feedback (SuperHF).

Scaffold

Simulated Comments from the Alignment Forum For Original Literature Development (SCAFFOLD) is a writing tool that generates LLM comments that are similar to human comments from an online AI research forum by fine-tuning a 6B parameter LLM on those comments.

Responsible for: Everything sans data collection scripts.

Automated Sandwiching

An automated Sandwiching experimental framework for evaluating scalable oversight techniques by having language models talk to each other. We won 1st place in all categories at Apart Research's Scale Oversight Hackathon.

Responsible for: Generating the initial research question and contributing roughly half of the theoretical development, the Python implementation, and the paper writeup.

Backup Transformer Heads are Robust to Ablation Distribution

Some quick mechanistic interpretability research into backup name mover heads for indirect object identification (IOI) in GPT-2 small, which won 2nd place in Apart Research's second Interpretability Hackathon.

Responsible for: Deciding on the research question, engineering and running the mean ablation experiments, evaluating and interpreting results.

MLAB Transformers From Scratch

A documented and unit-tested repo to help you learn how to build transformer neural network models from scratch.

Responsible for: Taking the transformer days from the original MLAB repo, creating a clean starter file with class/method stubs and clear docstrings, adding unit tests, and implementing a clean solution file.

Concept-Based Explanations

A slightly new technique building off prior work for unsupervised learning of human-interpretable concept-based explanations of language models operating on the task of sentiment analysis. Compared to black-box baseline models, performance is comparable, but the coherency of discovered concepts is sometimes mixed.

Responsible for: Almost all of the code, most of the paper.

Levelling Up in AIS RE

A level-based guide for independently up-skilling in AI Research Engineering that aims to give concrete objectives, goals, and resources to help anyone go from zero to hero.

Responsible for: Everything, though it draws upon knowledge from others listed in the Sources section.

LLM Multiplication Tables

Evaluating the 2-digit multiplication abilities of various language models using 🤗Transformers and the OpenAI API for fun and to possibly inspire some mechanistic interpretability research. By graphing more nuanced metrics than accuracy like the number of digits different from the answer, we can see clearer patterns in emergent multiplication capabilities.

Responsible for: Everything.

Minitorch Self-Study Guide

While implementing Minitorch, a Python implementation of the core functionality of the popular PyTorch machine learning library, and getting a better understanding of how autodifferentiation, tensors, and other PyTorchic things work, I created this study guide to help others learn as well.

Responsible for: All of the study guide, none of Minitorch.

UE4ML

A plugin to facilitate the use of deep reinforcement learning in Unreal Engine by exposing UE4 as an OpenAI Gym environment and a suite of Deep RL sample projects to test the plugin.

Responsible for: Cleaning up and improving the plugin for release, creating all the sample projects while working for Epic Games.

CNNs for CGI Detection

A binary CNN classifier to distinguish between real photographic images and photorealistic computer-generated images that achieves 96% test accuracy on a custom dataset.

Responsible for: Approximately half of the work.