BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Understanding What You're Searching For In A Multilingual World

Following
This article is more than 8 years old.

At the core of all search engines like Google is the need to properly interpret highly ambiguous questions that may be just a few words long, anticipate the user’s intent in asking the question, and determine the most relevant information to return based on how the user is most likely to use it. The increasingly multilingual nature of the modern web and the rich linguistic variety that comes with it makes this an ever more complicated process. In particular, attempts to use search data to understand macro-level societal patterns quickly run into challenges stemming from the fact that searches for the same topic may involve very different words across the world's languages. Here I explore how Google Trends attempts to address the multilingual challenge and also the limits of their approach and where it can yield contradictory results.

Let's begin by imagining a one-word Google search for “pizza.” To return the most relevant results, Google must determine from this single word whether the user would likely want a list of nearby pizza restaurants (perhaps someone new to the area), a list of today’s lunch specials (someone who already knows the nearby restaurants and is deciding where to go for lunch), a list of pizza recipes (someone wanting to cook it themselves), a history of pizza and its global impact (a student writing a school paper), or perhaps the latest pizza trends (a chef experimenting with new ideas). Each of these implies a very different definition of relevance in terms of what kinds of pages should be returned to the user.

You can see a glimpse of this complexity by glancing at the Google Trends entry for “pizza” and scrolling to the bottom of the page to see the list of related searches. Everything from “pizza near me” to coupons to recipes can all be found in the list of searches. It is obvious from the search volume timeline that the web-searching public has grown steadily more interested in this menu item over the last decade at an almost perfectly linear rate. Looking at the map below, it would appear that the United States, Canada, and Australia/New Zealand are all the leading countries searching for pizza, with its Italian birthplace ranking relatively low.

The reason for this is that obviously “pizza” is an English word and so the resulting map reflects only English-language web users. To accurately understand the global geography of search interest for pizza, one must search for its name across all the world’s languages. To assist with understanding topics in a multilingual world, Google Trends offers “Topics” in which predefined thematic headings group all related words, alternative spellings, and names in other languages under a single label. Google gives the example of the Topic “Tokyo - Capital of Japan” incorporating terms like 東京, Токио, Tokyyo, Tokkyo and related phrases like “Japan Capital.” Examining search interest in the Topic “Pizza” instead of the exact English keyword “pizza” yields an identical interest timeline, but a very different geographic map – this time with interest centered in Italy and Europe rather than the United States (though the United States is still strongly represented).

Topics can therefore be extremely powerful, grouping together translations into many languages under a single heading. On the other hand, linguistic overlap, in which a word exists in multiple languages with very different meanings in each language, can complicate the ability to use Topics for multilingual searches. To demonstrate this, the timeline below shows searches within the United States for the English phrase “united nations.” Immediately clear is the steadily declining interest in the organization over the last 10 years, which is also seen in worldwide searches for the phrase.

An Arabic or Japanese speaker, however, would likely not use the English phrase “united nations” and thus Google helpfully has the Topic “United Nations” that groups together its common spellings and names in other languages. Searches in the United States for the Topic “United Nations” show relatively stable search interest, driven primarily by the Topic’s inclusion of the acronym “un” that is commonly used as shorthand to refer to the agency. However, looking at worldwide searches for the Topic in the timeline below, it is almost a mirror image of the one above, exhibiting a linear increase globally in interest in the United Nations.

What could be driving this? The primary factor appears to be the inclusion of searches incorporating the word “un” as a proxy for the United Nations. Looking at the map of which countries are searching the most for the Topic United Nations, Latvia dominates the list, while the remaining countries are almost exclusively French and Spanish-speaking nations. The Latvian, French, and Spanish languages all incorporate the word “un” as a common grammatical article, meaning it features extensively in ordinary searches the same way the word “the” does in English-language searches. In fact, simply searching Google Trends for the exact word “un” yields a timeline fairly similar to that for the United Nations Topic. Looking more closely, even American searches for “un” tend to emphasize Spanish-language searches like “Darte un Beso,” a famous 2013 song, “como hacer un” (“how to” guides), and searches for “Kim Jong Un” alongside searches for the United Nations. This suggests not only that the United Nations Topic may offer an inaccurate timeline due to its inclusion of “un,” but that even limiting search data to a particular country is insufficient to overcome such linguistic differences – search disambiguation must operate at the linguistic, rather than geographic, resolution.

The underlying problem is that topics are simply static predefined groupings of words determined by machine learning and/or human editors to be relevant to a given topic at a global scale. In essence, it is a massive Boolean OR statement that does not perform semantically-enriched contextual disambiguation of how a given word is being used in a query, the language of the query, previous queries, and so on to perform true semantic disambiguation. Relevancies appear to be largely determined based on their most common global usage, rather than assigning relevance based on contextual usage and thus the word “un” seems to be treated as a reference to the United Nations whether you are searching in English or Latvian. What makes this problematic is that users are not provided an easy way to review the complete list of terms that make up a given Topic, to see their relative linguistic and term affiliations, and to edit the list to remove problematic terms based on domain knowledge.

Yet, it is not just linguistic issues that can confound the use of Topics. Examining the Trends Topic “United States presidential election, 2016” and narrowing the results to users in the United States shows interest growing rapidly just months after President Obama’s 2012 reelection. Yet, it also shows that interest in the 2016 election was higher in October 2004 than it is today.

Looking at the “Related Searches” table it is clear that many of the terms like “election,” “presidential election,” and “election polls” are generic terms not specific to the 2016 election, likely accounting for the 2004 spike. However, the second most related term is “2016,” a generic search for anything to do with the year 2016 including the 2016 Olympics, various 2016 model year automobiles, and other major 2016 events, with the presidential election accounting for only a fraction of those searches. Exploring further, searches relating to 2016 seem to exhibit a very similar upwards trend, suggesting that 2016-related searches are driving the upwards curve seen in the Google Trends Topic. Searching for just the word “election” instead yields a relatively stable search profile that shows no appreciable growth in 2015 compared with past years and where current search levels are no higher than in the lead up to any of the past elections.

That leaves the critical question of which of these graphs is correct? Did search interest in the 2016 election really peak over a decade ago and ramp up again starting right after Obama was reelected? Or are searches on the election relatively stable and no higher than before any previous election of the past decade? Without further information, such as an exhaustive list of all of the terms included under the 2016 election Topic and their relative contributions, it is simply impossible to know which is the “correct” timeline. Indeed, in the “big data” era analyses are frequently based on such predefined aggregations and filtering operations that are largely opaque with little visibility into what decisions they are making.

We see here two key themes at work – the impact of a multilingual web on search disambiguation and the opaqueness of the data filtering that often has substantial influence on the results we receive from our analyses. Google’s use of predefined Topics to group together translations and alternative spellings and names for a theme is a powerful first step towards moving beyond linguistic boundaries when examining macro-scale search behavior. On the other hand, linguistic ambiguity can confound the findings of such groupings, both from the standpoint of ambiguous words that have different meanings in different languages, and questionable inclusions, like the terms leading to a high correlation with “2016” in the case of the presidential election. Greater transparency is therefore needed before approaches like that used by Topics can enter mainstream use. Adding an interface that allows for the display of all terms incorporated under a Topic, along with their relative contributions to the final results, and the linguistic, geographic, and terminological contextualization of each, along with the ability to edit the list to add and remove terms, would go a long ways towards addressing these issues.

As the web expands from its origins as a small English-language informational exchange for academic researchers into a global information fabric spanning the world’s languages, these kinds of complexities are simply inherent growing pains as the world of information research rushes to catch up with our increasingly globalized world and we head towards a “post lingual” world in which language is no longer a constraining barrier on our understanding of the world around us.