Presented at #sqlpass summit 2015.
In this post, I’ll list few examples from various industries to help you differentiate between business intelligence and data science problems.
Sometime back, I blogged about “Business Analytics Continuum” and in the post we saw that Every Organization has DATA but they use their business data at different levels because of their maturity level. Excel (or other transactional reporting tools) is usually the starting point for any organization – it helps them see WHAT happened. They advance to the next stage, where they get capabilities to slice and dice their data – To find out WHY – and usually this capability is delivered using Business Intelligence tools & techniques. Once the data culture spreads – Thanks to a successful Business Intelligence project – then they soon start to outgrow their business intelligence capabilities by asking problems that need predictive capabilities. This is advanced analytics and Data Science stage. To that end, here are 5 examples to help you differentiate between business intelligence and data science problems:
|Business Intelligence.(WHAT & WHY)||Data Science & advanced analytics.|
||Can you predict bike rentals on an hourly basis?|
||Can you predict the credit risk of the customer during contract negotiations stage?|
|Customer relationship management||
||Can you predict customer churn?|
||Can you predict whether a scheduled flight will be delayed by more than 15 minutes?|
||Can you classify a customer feedback comment into “positive”, “negative” or “neutral”?|
I hope this helps!
Insider’s Introduction to Microsoft Azure Machine Learning (AzureML)
Thu, Sep 18, 2014 12:00 PM – 1:00 PM EDT
Microsoft has introduced a new technology for developing analytics applications in the cloud. The presenter has an insider’s perspective, having actively provided feedback to the Microsoft team which has been developing this technology over the past 2 years. This session will 1) provide an introduction to the Azure technology including licensing, 2) provide demos of using R version 3 with AzureML, and 3) provide best practices for developing applications with Azure Machine Learning.
Mark is a consultant who provides enterprise data science analytics advice and solutions. He uses Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, SAS, SPSS, R, and Hadoop (among other tools). He works with Microsoft Business Intelligence (SSAS, SSIS, SSRS, SharePoint, Power BI, .NET). He is a SQL Server MVP and has a research doctorate (PhD) from Georgia Tech.
Hope to see you there!
Business Analytics Virtual Chapter’s Co-Leader
Classification algorithms are commonly used to build predictive models. Here’s what they do (simplified!):
Now, here’s the difference between Multi Class and Two Class:
if your Test Data needs to be classified into two classes then you use a two-class classification model.
1. Is it going to Rain today? YES or NO
2. Will the buyer renew his soon-to-expire subscription? YES or NO
3. What is the sentiment of this text? Positive OR Negative
As you can see from above examples the test data needs to be classified in two classes.
Now, look at example #3 – What is the sentiment of the text? What if you also want an additional class called “neutral” – so now there are three classes and we’ll need to use a multi-class classification model. So, If your test data needs to be classified into more than two classes then you use a multi-class classification model.
1. Sentiment analysis of customer reviews? Positive, Negative, Neutral
2. What is the weather prediction for today? Sunny, Cloudy, Rainy, Snow
I hope the examples helped, so next time you have to choose between multi class and two class classification models, ask yourself – does the problem ask you to predict two classes or more? based on that, you’ll need to pick your model.
Example: Azure Machine Learning (AzureML) studio’s classifier list:
I hope this helps!
Knowing why we do what we do is important. Stephen Few lists four reason for creating Tables & Graphs in his book “Show me the number”. I really liked them so I am posting it here for your reference:
- it helps us communicate. It helps present information to others.
- it helps us analyze data. it helps us find the insights in the data.
- It helps us Monitor Performance. It helps us keep track information about performance e.g. Sales Performance, Speed of Manufacturing, etc.
- It helps us Plan. It helps us predict and prepare for the future.
Data Explorer let’s you “Explore” (search) for web-based public data. This is a great way to combine data that you may have in your data-sources with public data sources for data analysis purposes. Sometimes your data might not tell you the reason behind the observed trends, so when that happens – you can try to see if a public data-set might give you the much-needed context. Let me give you an Example before we start hands-on w/ data explorer so that you have better understanding of importance of public datasets. Here’s a sample that I found here. So, Here’s a demo:
An auto company is seeing sales trends of Hybrid cars and SUV’s from the sales data-sources. But what is the reason behind that? company data does not show that. Someone hypothesizes that it might be because of gas prices. So they test out the hypothesis by combining gas prices information available via public data. And turns out gas prices might be the driving force of sales trends! SEE:
if the gas prices increase, then the sale of SUV go down and the sale of Hybrids go up:
You know that public data can be helpful! So how can you search for public data-sets? Well, You can manually search online, ask someone, browse through public data repositories like azure data market (and other data markets), there’s also a public data search engine! OR you can directly search for them via Data Explorer.
Here are the steps:
1) Excel 2010/2013 > Data Explorer Tab > Online Search > type “Tallest Buildings”
3) Now let’s do some filtering and shaping. Here are the requirements:
– Hide columns: Image, notes & key
– clean columns that has heights data
– Show only city name in location
OK, let’s get to this one by one!
4) Hiding Columns:
Click on Filter & Shape button from the Query Settings:
Select Image Column > Right Click > Hide:
Repeat the steps for notes & key column.
Click on DONE
5) clean column that has heights data.
Click on Filter & Shape to open the query editor
A) let’s rename it. Select column > Right Click > Rename to Height > press ENTER
B) let’s remove the values in brackets. Select Column > right click > split column > By delimiter > At each occurrence of the delimiter > Custom and enter “(” > OK
Hide height.2 and rename the height.1 to height
Click on DONE
6) Let’s just have city names in the location column
click on Filter & shape to load query editor:
A) select location > right click > split column > by delimiter > Custom – Enter: ° in the text box like this:
Hide Location.2, Location.3, Location.4 & Location.5
Select Location.1 > Right Click > Split Column > by Number of characters > Number of characters: 2 > Once, as far right as possible > OK
Hide Location.1.2 and rename Location.1.1 to Location
One last thing! making sure that the data type of height is numbers.
Select height > change type > number
Select floors > change type > number
click on DONE. Here’s our filtered and shaped data!
7) LET”S VISUALIZE IT!
For the purpose of visualization I copied first 20 rows to a separate excel sheet and created a chart:
That’s about it for this post. Here are some related Posts on Data Explorer:
Unpivoting data using the data explorer preview for Excel 2010/2013
Merging/Joining datasets in Excel using Data Explorer add-in
Remove Duplicates in Excel Tables using Data Explorer Add-in
Web Scraping Tables using Excel add-in Data Explorer preview:
Your comments are very welcome!
One of the key thing I’ve learned is importance of differentiating the concepts of “Data Reporting” and “Data Analysis”. So, let’s first see them visually:
Here’s the logic for putting Data Reporting INSIDE Data Analysis: if you need to do “analysis” then you need reports. But you do not have to necessarily do data analysis if you want to do data reporting.
From a process standpoint, Here’s how you can visualize Data Reporting and Data Analysis:
Let’s thing about this for a moment: Why do we need “analysis”?
We need it because TOOLS are really great at generating data reports. But it requires a HUMAN BRAIN to translate those “data points/reports” into “business insights”. This process of seeing the data points and translating them into business insights is core of what is Data Analysis. Here’s how it looks visually:
Note after performing data analysis, we have information like Trends and Insights, Action items or Recommendations, Estimated impact on business that creates business value.
Data Reporting ≠ Data Analysis
— Paras Doshi (@paras_doshi) January 30, 2013
And now, the 12 one hour sessions ranging from data visualization, predictive analytics to Big Data are online for you to watch! They also serve as “Trailer” for what you can expect at the PASS Business Analytics conference!
Here’s the URL: http://passbaconference.com/Sessions/SneakPeeks.aspx
— 24hop (@pass24hop) February 8, 2013
And I was following some of these sessions live on the event day – and I can tell you, these sessions are great resources!
Also, I participated in the twitter contest (by Microsoft BI) that was happening along w/ the event – and this is what I got for my win!
That’s about it for this post. Enjoy the recordings!
I like exploring data sets to find interesting patterns from them. To that end, I was exploring a data-set: List of companies by revenue and I added a column to calculate Revenue/Employee to explore the dataset:
And I found an outlier!
Here’s the outlier: Exor
Here’s what it’s interesting:
It’s revenue in 2012 is: 109.15 billion USD
And number of employees is just 40!
Just think of Revenue/Employee !
To put things in perspective, Lets Compare that with its neighbor in the data-set:
Rank | Company | Industry | Revenue in USD billion | Employees
I got to know about this by quickly creating a data visualization to explore the data-set:
And removing Trafigura, Vitol and Exor, this is what we have:
Observation: oil and gas industry have relatively higher revenue/employee ration.
That’s about it for this post. Thanks for reading about my data exploration!
- Playing w/ the Occupational employment Statistics Data-Set: (parasdoshi.com)
SEMMA is an acronym introduced by SAS which stands for:
Sample, Explore, Modify, Model and Assess.
I had recently posted about the Data Mining & Knowledge Discovery Process which had following sequential steps:
Raw Data => cleaning => sampling => Modeling => Testing
SEMMA follows the similar sequential steps as we had seen in the data mining process. So while Data Mining process is applicable to any data mining tool out their, SEMMA helps when you use SAS enterprise miner. In fact, it has helped me quickly find the data mining functions available in SAS tool: