Google Ngram tool is a boon for academics

December 17, 2010, 6:49 PM UTC

Google is putting all of those scanned books to good use.

Google (GOOG) has a deep history in academics, with its founders coming from Stanford’s Computer Science PhD program and untold numbers of PhDs amongst Google’s ranks.  So it isn’t a surprise to see some of Google’s products having secondary uses as academic tools.

Google is scanning and searching the world’s books and cataloguing those texts into massive searchable databases to both advertise against and also distribute as part of its Google Bookstore and Google Books projects.  But some researchers found a way to gain some more knowledge of trends in the books Google has catalogued.

Since 2004, Google has digitized more than 15 million books worldwide. The datasets we’re making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year.

These datasets were the basis of a research project led by Harvard University’s Jean-Baptiste Michel and Erez Lieberman Aiden published today in Science and coauthored by several Googlers. Their work provides several examples of how quantitative methods can provide insights into topics as diverse as the spread of innovations, the effects of youth and profession on fame, and trends in censorship.

Enter GoogleLabs Ngram Viewer, which allows you to type in word strings and track those terms’ popularity over time in a variety of languages.

For instance, take a look at when and how “feminism” entered into the French vernacular at the turn of the last century:



Oh, and this is Google, so there is an easter egg.  Ngram “Never gonna give you up.”