During my master's studies at Stanford, I have the honor to work with Professor Christopher Ré and get inspired by other great minds in InfoLab. I work on projects in Knowledge Base Construction (KBC), the process of populating relational databases with information extracted from data. I develop DeepDive, a data management tool that is powerful for KBC. Through these experiences, I discovered how systematic research methods boost industrial applications, and how engineering tasks facilitate research projects.
I will be a research intern in Knowledge Base Group of Toshiba Corporation, from Jan 2015 to Mar 2015.
Projects at Stanford
Contributed to DeepDive's feature extraction pipelines.DeepDive features a pipeline that enables flexible parallel feature extraction. This pipeline is vitally important and widely used in all our applications, but it suffered from unsatisfactory speed. I researched into the problem, and found that the parallel task scheduler, data loader and unloader were the bottlenecks. I implemented two faster code paths to solve this issue: one code path uses the system tool "xargs" to manage parallelism, and optimized loading and dumping of the database; the other code path compiles user's extraction script to a database procedural language, to further reduce disk I/Os by running the UDF in database. These implementations turned out to be 10x--20x faster than the original one, and unblocked many large-scale research projects.
Ported DeepDive from PostgreSQL into MySQL, MariaDB, and MySQL cluster ,to extend its usability. I also refactored the code base for easier integration with other DBMS. A challenge occurred that our newly-supported distributed DBMS, MySQL cluster, suffers from slow data loading with DeepDive. To tackle this problem, I implemented a faster loader for DeepDive using APIs provided by MySQL cluster.
Working on an interactive KBC tool that automates feature engineering. The systematic way of feature engineering in KBC has been proposed, but not well automated by DeepDive. I was working on a tool named BrainDump to automatically generate reports to summarize each run of DeepDive and auto-detect possible failure modes. I was also working on the visualization of end products for KBC, to automatically serve the generated knowledge base online, with various ways to interact with users.
Patent Claim Structure Extraction
I led a individual research project in Toshiba R&D, advised by Orihara Ryohei and Okamoto Masayuki, to extract US patent claim structure for analysis and comparison.
Patent engineers are spending significant time analyzing patent claim structures to understand the range of technology covered or to compare similar patents in the same patent family. Though patent claims are the most important section in a patent, it is hard for a human to examine them. In this paper, we propose an information-extraction-based technique to grasp the patent claim structure. We confirmed that our approach is promising through empirical evaluation of entity mention extraction and the relation extraction method. We also built a preliminary interface to visualize patent structures, compare patents, and search similar patents. This work has been published in SIGIR 2017.
Public Scientific Knowledge Base
I was building a prototype of open knowledge base for Public Library of Science (PLOS), which integrates scientific entities and relations extracted by DeepDive.
The demo to the left shows the view of our interface, where all scientific entity and relation mentions (genes and phenotypes) are highlighted in the paper, and you can easily view top genes and phenotypes in this paper with a summary on the left column. You will be also able to provide feedback on whether each extracted mention is correct, by using the tagging interface to the right.
Research Projects at Peking University
Ranking and analyzing baseball network
This was my course project for SI 508---Networks: Theory and Application, given by Prof. Qiaozhu Mei from UMich. The project originates from my idea to regard American Major League Baseball (MLB, of which I am a big fan) games as a network, with players as nodes and their win-lose conditions in games as links.
To rank the players in the network, we first tried PageRank, but it failed to describe a special attribute of the network: a pitcher who defeats good batters is a great pitcher, and a batter who wins skilled pitchers is an awesome batter. Faced with this obstacle, I used the intuition in HITS algorithm (with hubs and authorities) to modify PageRank, and raised a new random walk algorithm to measure the two abilities. Our next problem was to evaluate our algorithm, when there are no definite criteria to judge baseball rankings. Therefore I compared our results to a prestige ranking system named ESPN Ratings, and the plots show that we achieve similar results with ESPN while having a simpler model and a wider capability.
In the data-mining phase, I studied the network over time, and found interesting patterns that recent players are getting closer in their skills than before, and good pitchers are better than ordinary pitchers at batting.
Defending against cloning attacks in OSNs
My other independent research is about defending against cloning attacks in Online Social Networks (OSNs). Cloning attackers disguise fake accounts as existing users by copying their profiles, and send requests to the friends of the cloned victim. This project is motivated by my long time interest in making OSNs more robust, and by my coincidental encounter with a cloning attack while using Renren (a Facebook-style OSN in China). I conducted a literature search and found that although earlier studies described this attack pattern, it cannot be adopted for large-scale attacks, and they did not provide a method for defending against it. So I first improved the attack pattern by snowball sampling (adding cheated people's friend) and iteration attack (cloning cheated people's friend), to point out its potential threats. Secondly we tested its feasibility on Renren. Then I came up with a simple but powerful server-side defending system by IP sequence matching. I also notice that the defending strategy is fragile to IP spoofing, so in the future I'd like to study stronger metrics of account identity, like clicking pattern matching and action time similarity.
Detecting Sybil groups in OSNs
My major research project advised by Professor Yafei Dai in her Lab is to detect Sybil Attack groups in OSNs. Sybil Attackers manipulate multiple accounts to increase the attacker's power. We aimed at detecting Sybil attacks in the wild, in cooperation with Renren---the "Facebook in China" with over 200 million users. I worked with Jing Jiang, a graduate student in our lab. Our paper is published in ICDCSW '12, and JCST. In the project, I coded all the programs in all phases from scratch, implemented efficient algorithms to handle the graph with millions of nodes, and designed many measurements based on discussion with Jing.
Assessing the Impact of User-interaction Transparency in Social Networks
I was involved in another project at lab, to understand latent user interactions, working with Jing Jiang advised by Professor Dai. In OSNs, profile browsing, which is latent to third-parties, are actually the most prevalent type of user interactions. Supported with the dataset provided by Renren, we compared this latent network with visible one of comments and retweets, based on the dataset provided by Renren. My work involved measuring structural properties including conductance, modularity and mixing time, for both visible and latent graphs. I enjoyed this research, in the process of understanding characteristics of different networks in the wild, and quantifying the human interactions. As a future question, it will be very attractive for me to compare the dynamics of latent and visible networks, especially their information diffusion, to discover how the hot topics and rumors spread among users.
Research Exchange Program at Technion
In Fall 2012 I joined a research exchange program at Technion, advised by Professor Daniel Freedman. At Technion I took advanced graduate courses, including seminars in Reliable Distributed Computing by Prof. Idit Keidar, and Program Analysis and Synthesis by Prof. Eran Yahav. I am also engaging in academic activities at Technion, being exposed to lectures and colloquiums on a wide range of topics in Computer Science, given by international researchers. Most importantly, I explored new research directions as part of a vibrant group. Our topics cover Programming Language, Systems, and Human Computer Interaction. We design description languages to automate the creation of both front-end and back-end systems, and discover human behavior patterns in interacting with services.