Chemistry likes to think of itself as the “central science.” Is that true? Intuitively it makes sense. But how can we measure that more rigorously? In comes the Stanford Dissertation Browser:
The Stanford Dissertation Browser is an experimental interface for document collections that enables richer interaction than search. Stanford’s PhD dissertation abstracts from 1993-2008 are presented through the lens of a text model that distills high-level similarity and word usage patterns in the data. You’ll see each Stanford department as a circle, colored by school and sized by the number of PhD students graduating from that department.
When you click a department, it becomes the focus of the browser and every other department moves to show its relative similarity to the centered department. The similarity scores are computed using a supervised mixture model based on Labeled LDA: every dissertation is taken as a weighted mixture of a unigram language model associated with every Stanford department. This lets us infer, that, say, dissertation X is 60% computer science, 20% physics, and so on. These scores are averaged within a department to compute department-level statistics (the similarities shown), and need not be symmetric. For instance, Economics dissertations at Stanford use more words from Political Science than vice versa. Essentially, the visualization shows word overlap between departments measured by letting the dissertations in one department borrow words from another department. Which departments borrow the most words from which others? The statistics are computed for each year in the data.
You can play around with the browser here. I’m assuming at some point in the near future this sort of analysis is going to get much, much, easier, because of the sea of data which powerful software can extract and visualize patterns out of. Below are the fold are five screen shots I thought were of interest. Genetics, biology, and chemistry dissertations in 2008. And Anthropology in 2007 and 1998.