Stanford Course Projects

Here are some fun course projects I led during Master's studies, along with the reports and posters. Some of these projects might be interesting enough for industry reference.

Ensemble Optical Character Recognition (OCR) Systems

This is the course project for Stanford CS229 Machine Learning, and independent study with Professor Christopher Re. I was the main contributor of this project.

We studied the problem of Optical Character Recognition on domain-specific articles (geo-science papers), and found that multiple OCR systems often make independent errors that can be fixed by each other. We implemented an offline Machine Learning model (SVM) to predict the correct output when the two OCR systems differ. This combines two state-of-the-art OCR systems at the time, Tesseract and Cuneiform. For error examples, Tesseract would often recognize rn as m, and Cuneiform would often recognize e as c. Our ensemble system was able to choose the correct answer in most cases, and achieves 89.80% accuracy when two OCR differs and one of them is correct, yielding a significant accuracy increase to any single OCR system involved.

Report: [PDF]

Capital Crunch: Predicting Investments in Tech Companies

Capital Crunch is our efforts to predict investments happening between investors and technology startups. We used data from CrunchBase, built Logistic Regression and CRF models, used linguistic features and social indicators. We achieved around 80% precision based on our heuristic evaluation methods.

Report: [PDF]

Poster: [PDF]

Code: [GitHub]


Kaleidoscope is an interactive ideation system, assisting humans in the brainstorming by automatically suggesting new ideas. It will record user's ideas into system's backend idea networks, and make suggestions that are relavent, diverse, and inspiring.

Report: [PDF]

Poster: [PDF]

DeepSpeech: A Scalable Decoding System that Integrates Knowledge for Speech Recognition

DeepSpeech is a project that uses DeepDive to decode the word lattice in speech recognition. It is able to integrate various features, and do probabilistic inference to choose a best path of words to output.

Report: [PDF]

Poster: [PDF]

Code: [GitHub]

Authorship Attribution in multi-author documents

Collaborative project with Tim Althoff and Denny Britz.

We bring up a novel problem of identifying the authors of scientific publications in a multi-author setting. Initial results show that writing styles can be used to predict authors with significant accuracy. This challenges the assumption that simply removing names from a paper submission ensures anonymity in a double-blind process.

Report: [PDF]

Slides: [PDF]

Previous Projects

MLB illustrator

MLB illustrator is a project visualizing the MLB game data as a heterogeneous network, providing baseball statistics, and ranking the batting and pitching ability of players.

Based on this project, I conducted independent research, and raised GameRank---a ranking algorithm for networks with multiple interplaying indicators.

Beijing 3DS Website

Beijing 3DS is an international startup competition held in Beijing.

I am the back-end designer of the event website. I mainly built NGINX server and application form handler, and successfully processed all the applications for the event.

Question answering system on Chinese Wikipedia

Team leader; designed QA algorithm using Chinese NLP techniques.

WordNet viewer featuring force-driven graph of words

Designed the dynamic relationship graph with an originated force-driven layout algorithm.

2D Shooting Game programmed with Haaf’s Game Engine

Sole developer; used C++ OO programming; designed barrages with a force model.

Minijava compiler for Android

Worked on optimization; realized general optimizations based on dafaflow analysis.

Kademlia network distributed simulation

Team leader; implemented a UDP-based P2P network using Kademlia DHT.

AI for game “Blokus” generated by Genetic Algorithm

Designed game AI; used GA to refine arguments for AI; ranked top 20% in department.