Joseph Kim

photo.jpg

about.txt

Hello! I’m Joseph, a researcher / programmer / student.

I am currently serving as an information security specialist in the Cyber Operational Group, Air Force for my mandatory military service in Korea.

I am interested in educational interactions between man & machine and technologies to improve learning outcomes.

Experience

Oct. 2024 – Present

Information Security Specialist

Republic of Korea Air Force

Monitoring external and internal network activity in headquarters of Korean Air Force.
Provide timely solutions to security breaches and system vulnerabilities.
Develop automated tools for routine checks for firewall and antivirus software, reducing manual workload.

Aug. 2022 – Jan. 2024

Teaching Assistant & Lab Monitor

Haverford College & Bryn Mawr College

CMSC 105 (Intro to CS), CMSC 231 (Discrete Maths)
Provided individual support during lectures, held weekly office hours, and graded homework.
Resolved technological difficulties and helped debug Python and Java code for 40+ students.

Oct 2021. – Jan. 2024

Classroom Aide

Phebe Anna Thorne School

Supported in teaching Mathematics and English to 16 kindergarten students.
Facilitated learning and engagement through various activities including sports, arts, and reading.

Education

2021 – 2027

B.S. Computer Science

Haverford College

GPA: 4.0
Took a 2 year gap year for mandatory military service in Korea.

2024

Study Abroad Semester

Yonsei University

Learning

Mathematics for Machine Learning - Deisenroth, Faisal & Ong

Book Ongoing

Big Data and Education - Ryan Baker

MOOC Complete

AI Engineering - Chip Huyen

Book Complete

Projects

Language Buddy

Learning langauges by interacting with AI chatbots.

Typescript Python React Next.js GraphQL MongoDB Redis Kaggle Vercel PEFT Qwen

View Project View Code

Bible Verse Explorer

Exploring the galaxy of bible verses in 3D space.

JavaScript Python Flask PostgreSQL Three.js Sentence-Transformer

View Project View Code

AI_CU

Helping students focus on recorded lectures.

Python Tkinter OpenCV MongoDB

View Code

Journal — Bible Verse Explorer

TLDR;

This project turns the entire Bible into a vector space, precomputes semantic cross-references between verses, and lets users explore them visually as a galaxy.

Features to Add Later

Support for multiple languages and translation versions
Better dimensionality reduction techniques to represent the verse embeddings in 3D
Faster query search using ANN algorithms like IVF-PQ or HNSW
Mobile Friendly website

Journal Entry

This project began as I was studying the bible. Often, when I was analyzing a verse, I would need to reference other verses to further reflect on its meaning. This is usually done through cross references, but I thought that sentence embedding could be used to achieve a similar result.

First, I obtained a dataset of the NKJV bible and cleaned the csv file to have the necessary information. Then, using sentence-transformer library, I converted all the bible verses as an embedding of 384 dimensions. All data for each bible verse, including the embedding was stored in npz format initially. The data was migrated to postgreSQL with pg-vector extension in Supabase later on. I used a relational database because there was a clear structure in my data.

Then, there was some debate on the computation required to compare the similarity of the verses. I could calculate it each time there was a search, but this would be very slow to compute with k-nearest neighbors. Because the database of bible verses would never be updated, I ended up doing the calculation beforehand and saving top 50 verses for each verse in the DB.

Next, I thought it would be useful to have a feature to compare the user query with the verses. For example, if the user searches the word "love", the app would return verses that are most semantically similar with the word. This is done through a simple brute force KNN algorithm. First, I thought it was the similarity calculation that that was the limiting factor, but it was actually the fetching of 31000 embeddings that caused the bottleneck. So, the embeddings are now fetched after the first search.

After deciding the structure of the project, I had to think about the best way to present the results in the frontend. Initially, it was a regular form with dropdown menus to choose a bible verse and a text box to input a query. But this looked way too dull.

Instead, I decided to represent each verse as a star, with all the verses representing a galaxy or universe of verses. This fit neatly with the main purpose of the site, which was to allow users to "explore" the verses.

Because this required reducing the embeddinng from 384 dimensions to 3, I began looking for dimensionality reduction techniques. To be honest, I am still not very familiar with dimensionality reduction techniques, but UMAP seemed to preserve the overall relation between embeddings better than others.

Because I wanted to get a working site as soon as possible, I didn't get to test with various parameters for UMAP. This, along with the inherent difficult of reducing the dimension more than hundredfold, led to a representation that is not always. If you search for different verses / queries, you will observe that some of them are close to each other, but many are also far away from one another, showing the limitation of the dimensionality reduction technique.

The website was originally deployed in Google Cloud. At the moment, I switched to Model, and the site will hopefully continue to be alive as long as Supabase and Modal does not change their free tier.

Journal — Language Buddy

TLDR;

An AI language-learning app with Next.js framework to provide a conversational partner with personalized learning. A fine-tuned small-sized multilingual LLM is hosted with optimization techniques like quantization and adapter switching.

Features to Add Later

Increasing performance with various prompting methods
RAG to obtain relevant information from the entire chat history
Online chatting with other users and group chat

Journal Entry

[Application]

This project came to my mind as I was studying spanish on my own. I was studying the language through a textbook, along with youtube videos and podcast for listening practice. However, because I was studying alone without a partner, I didn't have the opportunity to practice the languge in natural setting. So, why not build an AI conversational partner that can help me practice spanish more naturally?

I wanted to try a full stack framework, so I decided on Next.js with typescript. I also used graphQL as my api as I thought the schemas would have synergy with type safety. I used MongoDB this time, because the data would keep getting added and I wanted to stay flexible on what would be included in the database. With the big picture decided, I created a To-do list with detailed implementations, features to include, etc. A claude.md file was also designed for a easier AI workflow.

First was the basic components of the chatting app. This included a chat interface, message bubbles, date dividers, and side navigation. Then TTS and STT features were added for listening and speaking practice. At the moment, I used the webspeech API so the quality of the TTS is quite bad. I will probably work on a using a better API later on.

After a had the chatting app, next was deciding what information to collect from the user to provide a personalized learning experience. This included a brief introduction, language level, correction style, learning goals, and interest. (correction style was removed as the finedtuned model was not following the correction instructions well)

[AI Engineering]

Then, it was time to choose the model. My original model was Llama 3.2 1B Instruct. It was multilingual and small enough to be deployed sustainably.

After choosing the model, I tested the model with a series of conversations that a user may have. the response were too verbose and unnatural even after prompt engineering, so I decided to finetune the model. At first, I wanted to finetune the model on specific characters. For example, a user can chat with a Don Quioxte to learn spanish, or K-Drama character to learn Korean. But when I began searching for training data, I realized it would be impossible to get large amount of quality data for it.

So, instead I decided to finetune the model per langauge with LoRA. It would be done with 3 langauges: Korean, English, and Spanish. I chose English and Korean because I could check the quality of the response. I chose Spanish because I wanted to use the app myself to learn the language.

I still struggled to find multi-turn conversational data that could help with my finetuning. So after digging through kaggle and huggingface datasets with no satisfactory results, I decide to use synthetic data. I spent some time thinking of different variables to make the data more diverse. A list of various topics that users would generally be interested in, like travel or food. Other variables included conversation length (no. of messages), conversation position, and and CEFR levels. I also experimented with other variables like politeness and correction styles, but decided against it in the end.

After generating 2000 conversations per language, I used a the peft library in kaggle notebook to finetune the model. (free GPUs!) Training loss, validation loss, perplexity, and LoRA weights all showed that the model was learning well. I chose Huggingface space to host the finetuned model, as it offered 16GB of RAM for free. Once I set up the APIs for token streaming, adapter switching, and a simple gradio UI, I began testing the model.

Then I ran into 2 issues.
1. Despite having minimal system prompt and 1B paramter model, inference was extremely slow, taking 5-10 minutes to generate a simple response.
2. The performance was just bad. It was subpar in English, but struggle to understand the language or make coherent sentences in Spanish and Korean. I initially thought there was something wrong with finetuning, but the problem persisted with the base model as well. (In hindsight, I should have tested the performance before finetuning...)

To solve the first issue, I looked for ways to optimize inference on CPU. I quantized the model to Q4_K_M, tried the GGUF format and llama-cpp library, but it was still too slow. After many tries, I decided to move on to a hosting service that provided GPU and more RAM.

Because I was no longer restrained by the resources in Huggingface space, I searched for a better model. I ended up with Qwen2.5-7B-Instruct. This base model initially was giving Chinese responses to Korean prompts, and I thought about adding filters for each langauge. Fortunately, this issue was fixed with finetuning.

Finally, I designed a few different personalities for the chatbot to give more diverse experience for the learners. These included a curious personality a playful one, and more. Currently, they are done through prompt engineering, but it may change finetuning later on.

[Return to Application]

After that, it was adding the final touches, checking for errors, and deployment. Authetication was done through Auth.js with option for Google Credentials. Because my GraphQL subscription required a Pub/Sub, Redis was used. This setup would also be useful later for online group chats.

Currently, the model is hosted without a container and cold starts with requests, leading to slow intially responses. Of course, both the performance and the speed would improve by using SOTA model APIs, but I wanted to have full control of all the parts of this project.

As of March 2026, some new features have been added. An improve my sentence button has been added to help learners check their mistakes and improve upon them. A translate button has also been added for bot messages to help users understand phrases that they do not know.

Resume

Journal — AI_CU

TLDR;

A python program to help students pay attention to video lectures by detecting face presence and closed eyes with OpenCB library.

Features to Add Later

Migration to web
In-depth analysis of student behavior to discover parts of lecture that are not engaging

Journal Entry

This project was part of the undergraduate intern research project in the AIEd Lab. With many of the courses turning remote, we knew that students often fail to engage with the material through recorded lectures - many would simply let the video play with doing various off-tasks. Therefore, we wanted to find a way that helps student focus while not being overly instrusive.

Many approaches were discussed to measure student attention, but in order to keep the the project simple, we opted for face presence and open eyes as a marker for student paying attention to the lecture. The lecture would stop if face was not detected or eyes were closed for a certain period of time.

After developing a prototype, we ran tests on lab computers. Realzing that the program would not run smoothly on many of the computers, we switched to multithreading to have the video and face detection run parallel.

Unfortunately, I had to join the military service before the experiments actually took place.