Time: 11 AM (Indian Std Time)
Date: 15 January 2016
WeBind.in is organising a live screening of my lecture on ‘What is BI – a case study’.
Time: 11 AM (Indian Std Time)
Date: 15 January 2016
WeBind.in is organising a live screening of my lecture on ‘What is BI – a case study’.
This work is the result of a recent request by a client to be able to measure the number of Java developers in India. Although there are other methods that can be used to evaluate this number, short of a naional survey, all methods are only estimates with inherent errors which are difficult to measure. It is therefore important to compare several methods and evaluate the possible errors.
I resolve this problem in separate posts. However, one method involves comparing the total number of people using google to search for information related to Java programming in India compared to the rest of the world. There are several sources of information that can be used to estimate the total number of Java programmers in the world and therefore evaluate the answer to our client’s question.
Google trends is a fantastical tool that really demonstrate the power of the Internet in studying the Human species. It delivers the relative number of searches made by Internet users on the Google search engine in the world. One can determine relative the number of searches made since 2004 till today and compare various search terms. This has been used to compare the relative number of programmers of different programming languages.
Google trends also allows the comparison of different regional searches for a given search term. However, this is where the difficulty in data comparison starts. Google trends only allows relative comparisons, for it normalises the data as per the maximum of total number of searches made over the time period. Furthermore, for regional trends, it further normalises it to the total number of searches made by that region. In other words, searching for the term ‘Java tutorial’ might show up as 25% of all searches from the USA and 40% from a country like Bangladesh. However, it does not mean more people are searching for ‘Java tutorial’ in the later, for 25% of searches from the USA represents a far larger number since the population connected to the Internet and using Google as its search engine is far larger that those in Bangladesh.
Hence it is only possible to make limited comparisons between the 2 data sets.
There is however a solution to this dilemma. In order to shorten the content of this post I have to use some mathematical notation, please bare with me. Let’s us define the following,
– World Java Google data ensemble for a given time period t, or the relative google trends data for the search made world-wide for ‘Java’. (Note, it could be any other search)
– India Java Google data ensemble for a given time period t, or the relative google trends data for the search made from India for ‘Java’
Google trends data can be downloaded as a CSV file. is given as the actual hits on the search engine for a given search criteria divided by the maximum number of hits for that period t , , in other words,
for a shorter notation
The problem arises when we want to compare one set of data with another.
Say for example we wish to compare the search for “Java tutorial” in 2012 across the world and coming from India. Google Trends does not allow the superposition of these two sets of data on the same search. Hence the denominators in each data sets are different, allowing only comparative studies. However, within a given trend analysis, one can compare different search terms, for the entire data set is normalised using the same denominator.
Too overcome this here is a solution to start solving this problem. In the example, let’s say we want to compare the % of Indian searches for the phrase “Java Tutorial” to those of the rest of the world. First of all we ensure that the time periods being searched are the same in both data sets, hence I will drop the time factor t from the equation, we are therefore looking for,
We know that,
So we can conclude that,
Where, which is the crux of the problem. Google trends does not provide any of the normalising denominators it uses, hence making any significant quantitative comparison mute. However, there may be ways to circumvent this problem by introducing a normalising factor with which the 2 sets of data could be bridged. One way to do this is to build ensembles of data comprised of the data set in which we are interested and a second one which acts as a constant between the two ensemble data groups. I define the first ensemble as the the world data set,
where represents search results for a term that is expected to return near constant values when queried in the worldwide W context as well as the regional context, I in this case. We therefore define it as,
note that the term remains constant within the ensemble . We now define the equivalent regional data ensemble for India as,
where is the search term introduced above within the regional context, and is evaluated in the Google Trends data as,
Now, this is where the magic happens! We defined the search term ‘d’ such that,
I will come back to this assumption as this end of this evaluation, but for now this assumption gives us a handle to compare the 2 ensembles, and . We can drop the time t notation as we are looking at the same time period in both ensemble, such that,
We can solve for equation (2) above using this result, for recall that
We are able to build comparative studies between 2 Google Trends datasets by introducing a bridging search term which is as much as possible constant in both queries. How likely are we to find such a query result set? There are cultural patterns and trends which make certain searches much likely to occur within certain regions. We show an example below based on Indian search patterns which Google Trends reveals to be predominantly coming from its India-based visitors, this enabling us to build this bridging method.
I made an important assumption, the bridging search term d(t) is constant across the two ensembles. How likely is this ? There are several search terms which come only from India. For example, there is a popular online retail site in India called Myntra.com. Google trends reveals that searches made for ‘myntra’ is coming predominantly from within India.
Looking at the contribution from the rest of the world, we see that India is strongly represented, while smaller regions are also contributing
However, if we drill down into these other sources contributing to the search we realise that they are very small indeed, for example, the Bolivian connection is coming from a single town called Sucre,
a strange fact indeed, and it would be interesting to find out why the town of Sucre in the Bolivian Andean mountains have such a demand for this site, possibly some expatriate Indian (?). If we look at the other regions in the world, such as Qatar, we see they are coming from a single location, Doha, probably form the Indian diaspora in that city,
Internet world tracking data put penetration of Internet in India at a little over 15% of the population connected, that means close to 200 millions users (although probably higher with mobile phones). This same survey estimated 97% of this population using Google as it search engine. Bolivia has an Internet population of a little over 4 million, while Qatar has close to 2 million. Again, looking at neighbouring countries adoption of Google as a search engine I think we can safely assume that both these population uses Google at over 90%. Now, if we compare (going back to Google Trends) the relative search for ‘myntra’ and ‘google’ in these populations, we realise that less than 2 % (in fact too small to be noted by Google Trends inQatar, and barely 2% coming from Sucre in Bolivia ). On the other hand, within India, Google registers a 5% relative search.
So in conclusion, we have close to 10 million searches for ‘myntra’ coming from India while the other sources contributing to worldwide search trends are from Bolivia (close to 80,000) and Qatar (less than 20,000). Hence, no more than 100,000 of worldwide searches for the term ‘myntra’ comes from outside India. That’s less than 1% of the total, too small an error to worry about.
In the next post of this series, I will be exploring how to derive the total number of Java developers in India using the above tool and comparing this method with other methods available on the Internet.
The reasoning is quite simple: for example, water is an important part of anyone’s life. So in the case of Magellan (pictured above) what are the chances for his water molecules (that he used in his life-time) to be found in a glass of water we drink, or a drop of rain from the sky?
We need to solve 2 questions,
(Q1) How many molecules circulate in anyone’s body in their entire life-time?
– We know that our body regenerates entirely in 7 years.
– 60% of our body is made of water.
– We assume that our avg body weight is 50 kg in a life-time (linear increase from 0 to 20, steady weight of 80kg till 70)
– average life-time of 70 years.
H2O (kg/life-time) = (60% x 50kg) x(70 yrs / 7yr to regenerate completely)
300 kg of water in 1 life-time. 18gm of water represents 1 mol of water molecules (6 x 1023)
300 kg = 1028 molecules of water will circulate in the body in the entire life-time of an average person.
(Q2) How many water drops are there on earth?
– A recent survey by the US Ocean Survey concluded that there is about 1.3 x1021 litres of water on Earth (oceans, lakes,rivers).
– 1 drop of water is about 10mm3, or 10-5 litre (assuming the average density of water of 1gm/cm3).
So we can calculate 1.3×1026 drops of water on Earth.
Then IF we assume that all 1028 molecules of Magellan’s life-time are spread evenly in the entire water reserve of Earth, we can calculate how many of these molecules are present in each drop of water on earth:
1028 / 1.3 x1026 = about 80 molecules in each drop of water.
Note: This is a quick, back-of-the-envelope calculation and therefore full of assumptions. There are 2 main ones which would affect considerably the above number. The first is that we assume that the 1028 in a life-time are unique molecules, however this is simply wrong, because the chances are that someone will ingest their own water molecules (localised ocean, rain, river cycle), so this number would in reality be lower.
The 2nd main assumption is that these molecules are evenly spread throughout the world, and again this is not quite right. Possibly in time our climate and ocean currents would carry water molecules across the world, but there would always be pockets of higher concentration. However, this is somewhat offset by the fact that in this specific example, Magellan went round the world and spent a major part of his life in India and other Portuguese colonies around the world, so his molecules of water where naturally disseminated in various places around the world.
The general understanding of Business Intelligence is the process by which data collected by an enterprise (sales, purchases, production, etc) is compiled into summary reports for the management and executive leaders so as to enable to understand the overall status of their business. This data is usually stored in databases, although smaller enterprises also stores this in excel files. This data are records of core activity performed by the business. For example, a manufacturing business keeps records of raw material consumption, stock of products, payments, costs, profits and client history among others. A retail business on the other hand will keep records of sales , item stock, orders, profits, expenses and so on. Making sense of this data is one aspect of Business Intelligence. Another aspect is making sense of market data that is outside the direct sphere of influence of the enterprise. This is more difficult to obtain as it requires extensive market surveys to compile. In summary, Business Intelligence is the process by which data is transformed into information.
In the majority of enterprises, the extent to which business data is analysed and transformed into valuable information for a company’s executive management is often limited to tabular reports that answers two basic questions: How many? and How much? How many phones have we manufactured in a given period? Or, how much turnover was made from the sales of an item?
The basic tabular reports described above are the very tip of the BI iceberg. BI analysis can be categorised into 5 groups of questions which it attempts to answer:
This article sets out to give examples of BI analysis relating to the first 3 questions above. We will briefly review the last 2 questions, but these are beyond the scope of this article and are listed here for the information of the reader.
Next article: How Many? – BI General Question (Part 2)
Previous article: Introduction – What is the Business Intelligence, a case study.
Next article: How Much? – BI General Question (Part 3)
This is the second part of a general introduction to BI using a case study of a mobile phone store.
How Many? – this relates to the quantity of items handled by the enterprise and and measures are done over fixed periods of time (e.g. monthly, yearly or even since the start of the enterprise).
A lot of enterprise report this data as tabular or bar chart forms, but proper analysis can often reveal much more than what the data seems to represent. For example, here is the number of phones sold by a high street shop on a monthly basis:
Figure 1 shows a conventional bar chart plot of table 1. Neither the table nor the bar chart is of much use because none of the representation give us any added visual information. The only information one can clearly extract from figure 1 as opposed to table 1, is that certain phone categories outsell others.
Figure 2 is the same same data as figure 1 but the visual information is much clearer and straight away we see a trend appearing in the data: April and October are high sale months across all phone categories. There also seem to be a decline in some sales, but this is more difficult to spot from this graph. To reveal this trend, we need to apply a differential analysis. This is shown in figure 3 below.
The 2nd trend now appears much more prominently in figure 1.3, while all phones show prominent sales in the months of April and October, only Android phones show an increase in sales, all the other categories are in decline.
This result would be the start of an analysis exercise to answer the questions that naturally arise from this conclusion. Why are phone sales in decline, is it a general market trend or specific to the shop? Why are Android phones sales on the increase? Is it a market trend? Is it at the expense of other types of phones? These are some the questions the executive management would want to find answers to. Market data and past years performance of the shop would be required to fully answer these questions.
Previous article: How Many? – BI General Question (Part 2)
Next article: How Often? – BI General Question (Part 4)
This is the third part of a general introduction to BI using a case study of a mobile phone store.
How Much? – this relates to the money handled by the enterprise and measures are done over fixed periods of time (e.g. monthly, yearly or even since the start of the enterprise).
Again many enterprises report this data as tabular or bar chart forms, but proper analysis can often reveal much more than what the data seems to represent. Back to our high street phone shop example, here are the sales figure for phones sold on a monthly basis:
|Avg Unit Cost||Sales
Again, tabular form report does not tell the story, one needs to use visual representations spot the trends.
Figure 4 shows the line chart of the sales for each phone category in a month by month basis. The trend that is again apparent that we saw in figure 2 is the two high months in April and October. We can also conclude that Android and Smart phones are the biggest contributors to the turnover of the shop. However, the data is hiding more information. Let’s keep probing.
In figure 5, the total sales (turnover) is plotted as a line chart on the left axis, while the total number of phones is plotted on the right axis. The line chart reflects what we already expect, namely that the turnover is correlated to the number of phones sold. However, there is one rather curious artifact, the blue line (total sales) is below the number of phones (green line) in the first quarter, but rises and stays above it from the 3rd quarter. This means that there is intrinsic evolution of the unit price of phones which could tell the executive management a very important story. The next plot is an attempt to clarify this trend.
As we can see in figure 6, the average unit cost (total sales divided by total phones in a given month) is rising while the actual phone numbers sold is dropping. This would be the start of a deeper analysis to answer further questions that the executive management would invariably want to find. Which phone category is contributing most strongly to the rise in unit cost? Which phone category is most affecting the drop in turnover? The answers to these questions would allow the executive management to reevaluate the local marketing strategy for this shop as well as determine which phone categories the shop should promote. This kind of analysis is used to fine tune the sales and better perform in a competitive market.
Business Intelligence is about preparing your growth with analysis of your surroundings. Surveying your markets, your clients, your resources, and uncovering trends so as to be better prepared for the future.
Previous article: How Much? – BI General Question (Part 3)
Next article: How Likely & How Quickly – BI General Question (Part 5)
This is the fourth part of a general introduction to BI using a case study of a mobile phone store.
How Often? – this relates to trends that appear in the analysis of the previous two question groups. These are also measured over fixed periods of time (e.g. monthly, yearly or even since the start of the enterprise). How often does a certain trend/event occur? For example, looking at the high street phone shop, we can take a look at the stock values of the shop at the end of each month, along with the new stock delivery in the first week of each month:
This is where data representation becomes more difficult. The tables 3.1 and 3.2 show the records of stock of phones for the shop for the entire year. It is very difficult to see any trend whatsoever from the tabular report. Plotting this data would not make much sense either that it would not tell us much. However, let us assume that the executive management has decided that it’s high street shop should aim to maintain at least 25% of monthly phones sales as stock at all times. This would represent 1/4 of monthly sales, or more importantly 1 week’s sale figures, sufficient buffer time to order a new stock. Let us also assume that at the end of the month, a shop should not have in excess of 50% of sales numbers as stock. This is a precaution in order to ensure that not too much liquid assets are locked into stock. Now, if we analyse the stock figures as a percentage of monthly sales, we can see how the shop is doing relative to the directives of the executive management. This is shown in figure 7.
How often does the stock fall below the prescribed limit and potentially exposing the shop to a potential loss of sales opportunities? How often is the shop overstocked and therefore exposing itself to a cash flow problem? These questions are clearly answered in the representation of the data in figure 3.1. Here, a new type of analysis would be needed to create a stock KPI (Knowledge Performance Indicator) that would enable the shop to better control it’s stock. Such a KPI could be built from 12 month rolling sales data allowing for a better analysis of current trends. This analysis could also be turned into a stock order tool.
Previous article: How Often? – BI General Question (Part 4)
This is the final part of a general introduction to BI using a case study of a mobile phone store.
How Likely? – This group of questions require more complex analysis of past data to identify trends in order to predict future events. This can be done on sales figures, market share, or various other parameters that affects the company’s performance. This analysis can be presented in a simple tabular form, or even as a graph that reveals the past present and future trend. There is a fine correlation between the accuracy of the data and the prediction, however, a good analysis requires a weighted contribution of historical data. The older the data the lower its contribution to the prediction of a future event and inversely, recent trends and events have a larger impact on the prediction model. This kind of analysis is called predictive modelling.
How Quickly? – This kind of question is again based on more complex analysis and is also related to the ‘How likely’ scenarios from the previous question. Finding out how quickly certain milestones are going to be reached requires an underlying understanding of the likely consumption rates of the particular units being analysed. The answers to these questions impact the financial models of a company and are used to plan the investment opportunities that can be reached out for. This analysis also falls under the predictive modelling category.
There are many predictive analysis tools and methods used for this kind work, however, these methods are only as good as the understanding of the analyst in what they are doing. There are many predictive tools, even in a simple excel sheet the modelling equations offered on a graphical representation of a set of data is quite advanced, however if these are used with little understanding of the statistical errors built into them, then the understanding of their graphical representation can be very flawed, leaving the door open to a lot of misinterpretations. Hence one should always proceed with caution when studying the statistical results of an analysis.