This work is the result of a recent request by a client to be able to measure the number of Java developers in India. Although there are other methods that can be used to evaluate this number, short of a naional survey, all methods are only estimates with inherent errors which are difficult to measure. It is therefore important to compare several methods and evaluate the possible errors.
I resolve this problem in separate posts. However, one method involves comparing the total number of people using google to search for information related to Java programming in India compared to the rest of the world. There are several sources of information that can be used to estimate the total number of Java programmers in the world and therefore evaluate the answer to our client’s question.
Comparing regional with world trends
Google trends is a fantastical tool that really demonstrate the power of the Internet in studying the Human species. It delivers the relative number of searches made by Internet users on the Google search engine in the world. One can determine relative the number of searches made since 2004 till today and compare various search terms. This has been used to compare the relative number of programmers of different programming languages.
Google trends also allows the comparison of different regional searches for a given search term. However, this is where the difficulty in data comparison starts. Google trends only allows relative comparisons, for it normalises the data as per the maximum of total number of searches made over the time period. Furthermore, for regional trends, it further normalises it to the total number of searches made by that region. In other words, searching for the term ‘Java tutorial’ might show up as 25% of all searches from the USA and 40% from a country like Bangladesh. However, it does not mean more people are searching for ‘Java tutorial’ in the later, for 25% of searches from the USA represents a far larger number since the population connected to the Internet and using Google as its search engine is far larger that those in Bangladesh.
Hence it is only possible to make limited comparisons between the 2 data sets.
There is however a solution to this dilemma. In order to shorten the content of this post I have to use some mathematical notation, please bare with me. Let’s us define the following,
– World Java Google data ensemble for a given time period t, or the relative google trends data for the search made world-wide for ‘Java’. (Note, it could be any other search)
– India Java Google data ensemble for a given time period t, or the relative google trends data for the search made from India for ‘Java’
Google trends data can be downloaded as a CSV file. is given as the actual hits on the search engine for a given search criteria divided by the maximum number of hits for that period t , , in other words,
for a shorter notation
The problem arises when we want to compare one set of data with another.
Say for example we wish to compare the search for “Java tutorial” in 2012 across the world and coming from India. Google Trends does not allow the superposition of these two sets of data on the same search. Hence the denominators in each data sets are different, allowing only comparative studies. However, within a given trend analysis, one can compare different search terms, for the entire data set is normalised using the same denominator.
Bridging trend data sets
Too overcome this here is a solution to start solving this problem. In the example, let’s say we want to compare the % of Indian searches for the phrase “Java Tutorial” to those of the rest of the world. First of all we ensure that the time periods being searched are the same in both data sets, hence I will drop the time factor t from the equation, we are therefore looking for,
We know that,
So we can conclude that,
Where, which is the crux of the problem. Google trends does not provide any of the normalising denominators it uses, hence making any significant quantitative comparison mute. However, there may be ways to circumvent this problem by introducing a normalising factor with which the 2 sets of data could be bridged. One way to do this is to build ensembles of data comprised of the data set in which we are interested and a second one which acts as a constant between the two ensemble data groups. I define the first ensemble as the the world data set,
where represents search results for a term that is expected to return near constant values when queried in the worldwide W context as well as the regional context, I in this case. We therefore define it as,
note that the term remains constant within the ensemble . We now define the equivalent regional data ensemble for India as,
where is the search term introduced above within the regional context, and is evaluated in the Google Trends data as,
Now, this is where the magic happens! We defined the search term ‘d’ such that,
I will come back to this assumption as this end of this evaluation, but for now this assumption gives us a handle to compare the 2 ensembles, and . We can drop the time t notation as we are looking at the same time period in both ensemble, such that,
We can solve for equation (2) above using this result, for recall that
We are able to build comparative studies between 2 Google Trends datasets by introducing a bridging search term which is as much as possible constant in both queries. How likely are we to find such a query result set? There are cultural patterns and trends which make certain searches much likely to occur within certain regions. We show an example below based on Indian search patterns which Google Trends reveals to be predominantly coming from its India-based visitors, this enabling us to build this bridging method.
I made an important assumption, the bridging search term d(t) is constant across the two ensembles. How likely is this ? There are several search terms which come only from India. For example, there is a popular online retail site in India called Myntra.com. Google trends reveals that searches made for ‘myntra’ is coming predominantly from within India.
Looking at the contribution from the rest of the world, we see that India is strongly represented, while smaller regions are also contributing
However, if we drill down into these other sources contributing to the search we realise that they are very small indeed, for example, the Bolivian connection is coming from a single town called Sucre,
a strange fact indeed, and it would be interesting to find out why the town of Sucre in the Bolivian Andean mountains have such a demand for this site, possibly some expatriate Indian (?). If we look at the other regions in the world, such as Qatar, we see they are coming from a single location, Doha, probably form the Indian diaspora in that city,
Internet world tracking data put penetration of Internet in India at a little over 15% of the population connected, that means close to 200 millions users (although probably higher with mobile phones). This same survey estimated 97% of this population using Google as it search engine. Bolivia has an Internet population of a little over 4 million, while Qatar has close to 2 million. Again, looking at neighbouring countries adoption of Google as a search engine I think we can safely assume that both these population uses Google at over 90%. Now, if we compare (going back to Google Trends) the relative search for ‘myntra’ and ‘google’ in these populations, we realise that less than 2 % (in fact too small to be noted by Google Trends inQatar, and barely 2% coming from Sucre in Bolivia ). On the other hand, within India, Google registers a 5% relative search.
So in conclusion, we have close to 10 million searches for ‘myntra’ coming from India while the other sources contributing to worldwide search trends are from Bolivia (close to 80,000) and Qatar (less than 20,000). Hence, no more than 100,000 of worldwide searches for the term ‘myntra’ comes from outside India. That’s less than 1% of the total, too small an error to worry about.
In the next post of this series, I will be exploring how to derive the total number of Java developers in India using the above tool and comparing this method with other methods available on the Internet.