Presented at #sqlpass summit 2015.
There are many techniques to analyze data. In this post, we’re going to talk about two techniques that are critical for good data analysis! They are called “Benchmarking” and “Segmentation” techniques – Let’s talk a bit more about them:
It means that when you analyze your numbers, you compare it against some point of reference. This would help you quickly add context to your analysis and help you assess if the number if good or bad. This is super important! it adds meaning to you data!
Let’s look at an example. CEO wants to see Revenue numbers for 2014 and an analyst is tasked to create this report. If you were the analyst, which report would you think resonated more w/ the CEO? Left or Right?
I hope the above example helped you understand the importance of providing context w/ your data.
Now, let’s briefly talk about where do you get the data for benchmark?
There are two main sources: 1) Internal & 2) External
The example that you saw above was using an Internal source as a benchmark.
An example of an external benchmark could be subscribing to Industry news/data so that you understand how your business is running compared to similar other businesses. If your business sees a huge spike in sales, you need to know if it’s just your business or if it’s an Industry wide phenomenon. For instance, in Q4 most e-commerce sites would see spike in their sales – they would be able to understand what’s driving it only if they analyze by looking at Industry data and realizing that it’s shopping season!
Now, let’s shift gears and talk about technique #2: Segmentation.
Segmentation means that you break your data into categories (a.k.a segments) for analysis. So why do want to do that? Looking at the data at aggregated level is certainly helpful and helps you figure out the direction for your analysis. The real magic & powerful insights are usually derived by analyzing the segments (or sub sets of data)
Let’s a look at an example.
Let’s say CEO of a company looks at profitability numbers. He sees $6.5M and it’s $1M greater than last years – so that’s great news, right? But does that mean everything is fine and there’s no scope of optimization? Well – that could only be found out if you segment your data. So he asks his analyst to look at the data for him. So analyst goes back and after some experimentation & interviews w/ business leaders, he find an interesting insight by segmenting data by customers & sales channel! He finds that even though the company is profitable – there is a huge opportunity to optimize profitability for customer segment #1 across all sales channel (especially channel #1 where there’s a $2M+ loss!) Here’s a visual:
I hope that helps to show that segmentation is a very important technique in data analysis!
In this post, we saw segmentation & benchmark techniques that you can apply in your daily data analysis tasks!
In this post, I’ll list few examples from various industries to help you differentiate between business intelligence and data science problems.
Sometime back, I blogged about “Business Analytics Continuum” and in the post we saw that Every Organization has DATA but they use their business data at different levels because of their maturity level. Excel (or other transactional reporting tools) is usually the starting point for any organization – it helps them see WHAT happened. They advance to the next stage, where they get capabilities to slice and dice their data – To find out WHY – and usually this capability is delivered using Business Intelligence tools & techniques. Once the data culture spreads – Thanks to a successful Business Intelligence project – then they soon start to outgrow their business intelligence capabilities by asking problems that need predictive capabilities. This is advanced analytics and Data Science stage. To that end, here are 5 examples to help you differentiate between business intelligence and data science problems:
|Business Intelligence.(WHAT & WHY)||Data Science & advanced analytics.|
||Can you predict bike rentals on an hourly basis?|
||Can you predict the credit risk of the customer during contract negotiations stage?|
|Customer relationship management||
||Can you predict customer churn?|
||Can you predict whether a scheduled flight will be delayed by more than 15 minutes?|
||Can you classify a customer feedback comment into “positive”, “negative” or “neutral”?|
I hope this helps!
7 Ideas on Encouraging Advanced Analytics
Thu, Jul 17, 2014 12:00 PM – 1:00 PM EDT
Many companies are starting or expanding their use of data mining and machine learning. This presentation covers seven practical ideas for encouraging advanced analytics in your organization.
Mark Tabladillo is a Microsoft MVP and SAS expert based in Atlanta, GA. His Industrial Engineering doctorate (including applied statistics) is from Georgia Tech. Today, he helps teams become more confident in making actionable business decisions through the use of data mining and analytics. Mark provides training and consulting for companies in the US and around the world. He has spoken at major conferences including Microsoft TechEd, PASS Summit, PASS Business Analytics Conference, Predictive Analytics World, and SAS Global Forum. He tweets @marktabnet and blogs at http://marktab.net.
REGISTER HERE: bit.ly/PASSBAVC071714
hope to see you there!
Business Analytics Virtual Chapter’s Co-Leader
Dr. Steven Levitt’s (Indiana Jones of economics & Author of Freakonomics) work involves finding insights from data. In the keynote, he shared some of the interesting & fun insights that he found from data.
One Example: Dr. Levitt: According to the data, It was 7 times more dangerous to sell crack in Chicago than it was being in combat in Iraq. https://twitter.com/markvsql/status/322707949158006786
He also talked about other insights that he found which could also be found in his book Freakonomics. After getting audience fascinated about what analyzing data can do – he moved to his real world experiences of analyzing data for businesses. And tied all these fascinating insights back to some tips he had for the audience. Here is a brief recap of the tips he shared:
*Above text is linked to tweets.
That’s about it for this post. What do you think about the tips that Dr Levitt shared?
- Data Analysis is NOT new
- Data Mining is NOT new
- Predictive Analytic is NOT new
- Machine Learning is NOT new
- Statistics is NOT new
- And Data Science is NOT new
So what’s new?
- The rate at which data is produced.
- The variety in Data that’s being produced.
- The “amount” of data that’s being produced.
And we did not have Tools and Techniques before – But now we do! Indeed, We live in a VERY special time!
Here’s a nice 5 minute video titled “Data Science: Beyond Intuition”.
I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:
1) Password Length vs Frequency
2) Percentage of passwords having at least one special character vs passwords having no special character:
3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.
Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)
That’s about it for now – it was fun!
And for those interested, here are the few behind the scene technical details:
Tools I used:
1. Excel & 2. SQL Server
Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.
2 million passwords in a .txt file.
Information I appended to the data-set using TSQL:
1. Length of password
2. Has Alphabets?
3. Has Numbers?
4. Has special Characters?
Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”
That’s about it for the technical details. Ping me if interested!
- Where can we find datasets that we can play with for Business Intelligence, Data Mining, Data Analysis Projects? (parasdoshi.com)
- The top 10 passwords from the Yahoo hack: Is yours one of them? (zdnet.com)
Update 1st August: I found this too: UCI MAchine Learning Repository http://archive.ics.uci.edu/ml/
Update 12 Nov 2012: I found this! Link to 400 datasets! http://www.datawrangling.com/some-datasets-available-on-the-web
Update 19 Dec 2012: Lynn Langit has a list here: http://lynnlangit.wordpress.com/public-datasets/
Recently on SQL Server Data Mining Forum, I answered a question about where to find DataSets for Business Intelligence Project.
Apart from Datasets AdventureWorks and Contoso data-sets, there are places where you can download data-sets to play with for your Business Intelligence, Data Mining or Data Analysis Projects.
Here is the List of data-sets that I have collected:
3. Windows Azure Data Market
6. Hilary Mason’s Data-Set Bundle: https://bitly.com/bundles/hmason/1 (Also featured in Quora Link that I shared earlier)
7. And If you can’t find the data-set, ask it here: http://getthedata.org/
Have I missed anything? Do comment! I’ll add the link with due credit.