Stanford Course Projects
Here are some fun course projects I led during Master's studies, along with the reports and posters. Some of these projects might be interesting enough for industry reference.
Ensemble Optical Character Recognition (OCR) Systems
This is the course project for Stanford CS229 Machine Learning, and independent study with Professor Christopher Re. I was the main contributor of this project.
We studied the problem of Optical Character Recognition on domain-specific articles (geo-science papers), and found that multiple OCR systems often make independent errors that can be fixed by each other. We implemented an offline Machine Learning model (SVM) to predict the correct output when the two OCR systems differ. This combines two state-of-the-art OCR systems at the time, Tesseract and Cuneiform. For error examples, Tesseract would often recognize
m, and Cuneiform would often recognize
c. Our ensemble system was able to choose the correct answer in most cases, and achieves 89.80% accuracy when two OCR differs and one of them is correct, yielding a significant accuracy increase to any single OCR system involved.
Capital Crunch: Predicting Investments in Tech Companies
Capital Crunch is our efforts to predict investments happening between investors and technology startups. We used data from CrunchBase, built Logistic Regression and CRF models, used linguistic features and social indicators. We achieved around 80% precision based on our heuristic evaluation methods.
DeepSpeech: A Scalable Decoding System that Integrates Knowledge for Speech Recognition
Authorship Attribution in multi-author documents
Collaborative project with Tim Althoff and Denny Britz.
We bring up a novel problem of identifying the authors of scientific publications in a multi-author setting. Initial results show that writing styles can be used to predict authors with significant accuracy. This challenges the assumption that simply removing names from a paper submission ensures anonymity in a double-blind process.
MLB illustrator is a project visualizing the MLB game data as a heterogeneous network, providing baseball statistics, and ranking the batting and pitching ability of players.
Based on this project, I conducted independent research, and raised GameRank---a ranking algorithm for networks with multiple interplaying indicators.
Beijing 3DS Website
Question answering system on Chinese Wikipedia
WordNet viewer featuring force-driven graph of words
2D Shooting Game programmed with Haaf’s Game Engine
Minijava compiler for Android
Worked on optimization; realized general optimizations based on dafaflow analysis.
Kademlia network distributed simulation
Team leader; implemented a UDP-based P2P network using Kademlia DHT.
AI for game “Blokus” generated by Genetic Algorithm
Designed game AI; used GA to refine arguments for AI; ranked top 20% in department.