Back to basics: Data Mining and Knowledge Discovery Process


Once in a while I go back to basics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, I want to revisit Data Mining and Knowledge Discovery Process:

Here are the steps:

1) Raw Data

2) Data Pre processing (cleaning, sampling, transformation, integration etc)

3) Modeling (Building a Data Mining Model)

4) Testing the Model a.k.a assessing the Model

5) Knowledge Discovery

Here is the visualization:

knowledge discovery process data miningAdditional Note:

In the world of Data Mining and Knowledge discovery, we’re looking for a specific type of intelligence from the data which is Patterns. This is important because patterns tend to repeat and so if we find patterns from our data, we can predict/forecast that such things can happen in future.


In this blog post, we saw the Knowledge Discovery and Data Mining process.


Three V’s of Big Data with Example:


In this blog-post, we would see the Three V’s of Big Data with Example:

1. Volume:

TB’s and PB’s and ZB’s of data that gets created:

From the webinar “How to Walk The Path from BI to Data Science: An interview with Michael Driscoll, data scientist and CEO of Metamarkets” – A global surge in Data

2. Velocity:

The speed at which information flows.

Example: 50 Million tweets per day!

twitter 50 million tweets per day

(This is back in Nov. of 2010 – the number must have increased!)

UPDATE 23 Nov 2012: on, wikipedia it says – 340 million tweets per day!

twitter 2012 340 million tweets per day

3. Variety:

All types of data is now being captured which may be in structured format or not.

Example: Text from PDF’s, Emails, Social network updates, voice calls, web traffic logs, sensor data, click streams, etc

data variety big data

Image courtesy

And this may be followed by other V’s like V for Value.


In this blog-post, we saw Three V’s of Big Data with Example.

Related Posts:

Who on earth is creating “Big data”?

Examples to help clarify what’s unstructured data and what’s structured?

Book Review: The Data Journalism Handbook


Data Journalism Book CoverIn this post, I am going to write the Book Review for The Data Journalism Handbook

Earlier, I had shared an insight from the Book with you, Here it is: “World has changed, from what’s NEW to “what does it all Mean” – This means that Professionals who focus on reporting “what’s new” would soon be “out of job”. And they should start equip themselves with Analytics skills that helps them uncover insights from all the news around us and help us all make sense of information that’s all around us.

To that end, The book “Data Journalism” is a great inspiration for Journalist and it seems it’s meant to encourage journalist to start embracing the change. It inspires Journalists to think of stories and find data about it. So what’s it for Data Geeks? It encourages Data Geeks to help journalists weave story around the data that they found. The book also outlines resources that Data Geeks could use.

Now, Two things I really Liked about the book:

1. Examples & case-studies, Lots of them! very inspiring!

2. I came to know about Tools that I didn’t knew about before. I am going to use them!

You can read the book online (web version) for free here:

Five examples of Recommendation Systems on the web:


Recommendation systems is application of Data Mining Technologies. I have researched about how to implement a recommendation system and as a part of my research, I studied recommendation systems that are already out there on the Internet and here are five examples of Recommendation systems on the web:

1. Amazon

Customers Who Bought This Item Also Bought:

recommendation systems amazon customers who bought this also bought

Frequently Bought Together: (Example of Market Basket Analysis a.k.a Association Rules):

recommendation systems amazon frequently bought together

2. LinkedIn

You should read this: How does LinkedIn’s recommendation system work? – it would open up your brain to “recommendation” opportunities around you!

Jobs you may like + Groups you may like + Companies you may follow:

recommendation systems Linkedin Groups Jobs Companies

3. Netflix

Did you knew about Netflix Prize for improving their recommendation engine? If not you should read that!

Here’s their Movies you’ll love recommendation system:

netflix prize recommendation system

4. Twitter

People you may want to follow:

twitter who to follow recommendations data mining

5. Google

I do not have a screenshot but just wanted to point out the Google “personalize” (a.k.a recommends based on past behavior) search results based on your search history. And you can switch that off, if you want: Turn off search history personalization


In this blog-post, we saw examples of recommendation systems. The key take away is that there is more than one approach to building a recommendation system. The approaches can be based on 1. Past Behavior 2. Past Behavior of “friends” 3. Recommendation based on the Item that is being searched And you can definitely, Mix and Match!

And I hope this post helped you understand an application of data mining that’s all around us! And question: Where else do you see recommendation systems in action? Leave a comment!

Things I shared on Social Media Networks during Oct 19 – Nov 11


The Goal of this series is to recap the conversations that I’m having on social networks and I do not want my Blog readers to miss that. So Here is the recap of last three weeks:

1. I was at SQL PASS 2012!

SQL PASS 2012 Paras Doshi

2. A nice Dashboard!

Metro fied Business Intelligene Dashboard windows 8

3. Learn to build an Enterprise Information management system using SSIS, DQS and MDS:

 Enterprise Information management system using SSIS, DQS and MDS

4. Fake Data!

5. I reached 2000 points on MSDN!Paras Doshi reached 2000 points on MSDN!

6. A nice video by Jeremy Howard on Predictive Analytics:

7. A nice data visualization via the Data Mining add-in excel

nice data visualization via the Data Mining add-in excel

8. Get started on Hadoop on windows 7/server!

Download here:

Demo Here:

Hadoop on windows 7/server!

9. I was at Give Camp 2012! if you do not know about “Give Camp”, then you should check it out!

Here’s last year’s (2011) post:

Give Camp 2012

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin

Data Mining Demo for Marketing vertical: How to create a Targeted mailing list?


Tools I’ll be using for the Demo:

Excel 2010

SQL Server 2012 (specifically SQL Server Analysis Services)

Excel Add-in for Excel.

Sample data-set that comes with the excel add-in


Marketing Department needs to create Targeted Mailing list.

What data do we need?

To create a Targeted mailing list – we’ll need a historical data-set of customer purchase history

What will we do with the data?

Based on the historical data-set, we’ll be able to find “patterns” in the past consumer behavior. E.g. A single male going to college living in Europe is likely to buy a bike. And the using these patterns – we would then classify NEW customers.

Technically, we’ll be using the classification method using the Microsoft’s decision Tree algorithm

(Read the difference between classification and clustering)

Let’s get in action!

STEP 1: Build a Model

Data Mining Tab > click on classify:

data mining in excel example customer classification for maketing maling list 0

Follow the steps:

data mining in excel example customer classification for maketing maling list 1

Select the data:

data mining in excel example customer classification for maketing maling list 2

In this case, since we want to predict the likelihood of buying a bike – our column to analyze is BikeBuyer


data mining in excel example customer classification for maketing maling list 0 3

For the Demo, I am going to just leave it default. There are “optimization” steps that you can do but for the demo I am going to keep it super simple

data mining in excel example customer classification for maketing maling list 4

Name the model:

data mining in excel example customer classification for maketing maling list 5

The Model has been created!

data mining in excel example customer classification for maketing maling list 6

STEP 2: Query the MODEL to predict the likelihood of bike purchase of a new customer

data mining in excel example customer classification for maketing maling list 7

Select the model:

data mining in excel example customer classification for maketing maling list 8

Select the data:

data mining in excel example customer classification for maketing maling list 9

Specify the columns that would be used in predicting the likelihood:

data mining in excel example customer classification for maketing maling list 10

Add the column that will have the “predicted value”


data mining in excel example customer classification for maketing maling list 11

And example of Data Mining Expressions (DMX):

data mining in excel example customer classification for maketing maling list 12

For the demo, I am just going to add the column to the existing table:

data mining in excel example customer classification for maketing maling list 13

Yay! Here’s our Targeted Mailing list – see the last column:

Screenshot 1

data mining in excel example customer classification for maketing maling list 14

Screenshot 2:

data mining in excel example customer classification for maketing maling list 15

Now what?

Marketers can now send “coupons” to ONLY those people who are most likely to buy a bike! And so that’s how you create a targeted mailing list using the Excel Data Mining add-in.

How to Solve: Excel Data Mining add-in disappeared.



In this blog-post, we’ll see what you can do when the Excel data mining add-in disappears.


1. What happened?

So I have installed the Excel Data Mining add-in. 

sql server 2012 data mining excel addin

But I do not see the Data Mining Tab in Excel:

excel sql server data mining tab missing

2) So Now what?

I searched and found this.  and got it working for the software versions (Excel 2010, SQL Server 2012) that I had and so I am documenting it here.

3) Logged in as Administrator > Office button > Options > Add-Ins > Do you see the Data Mining add-in Disabled?

sql server 2012 data mining excel addin disabled excel options

4) Select Disabled Items in the Manage > click GO

excel options enable a disabled item data mining5) Click on the data Mining add-in and enable it. > Click close > Click OK

6) Re-open Excel. Can you see it now? Yes? Yay!

excel sql server data mining tab enabled yay

That’s about it for this post.


In this blog-post, we saw how to enable the data mining excel add-in.

Machine Learning VS. Data Mining


For the Past couple of months, One of the things that I have thought about is “What is the Difference Between Machine Learning & Data Mining”. I have Studied Data Mining and Advanced Data Mining concepts at both Undergraduate and Graduate level and recently I started learning about Machine Learning via  – I was curious to know the difference between the two similar/inter-related fields. After, spending time understanding what Machine Learning is – Here’s what I am thinking:

When I learned Data Mining – The focus was on Taking a Data-set and using (more than one) Algorithm(s) to detect Patterns in the data-set. I am studying machine learning – Here, we’re asked to write algorithms (and build models). So To me, Data Mining seems to be deal with practical aspects of putting Machine Learning algorithms to use.

When I took Data Mining courses – I didn’t write algorithms. But learned what different Data Mining Algorithms can do and what kind of patterns each algorithm helps us find. In machine learning class, my focus is to learn how to write the algorithms (build the model) and optimize it so that it can predict well.

Also, in machine learning the goal is clear – the questions are mostly like “Build a model from Past Data that predicts X “. whereas I remember, For our Graduate Level class, My professor gave our Team a data-set of “fatal accident data” and said “Go play with it!”

These were my experiences. What are your experiences with Data Mining, Machine Learning – and how do you differentiate between these two fields which are similar in more than one ways?

Excel data Mining in Action: Forecasting Twitter Followers for next week


OK, so you know I recently installed Data Mining Excel add-in: How to enable Data Mining in EXCEL powered by SQL Server Analysis Services? – and I couldn’t wait to go beyond the samples provided with the Excel add-in. So I decided to start with Forecasting. In this blog-post, I downloaded my Twitter stats into Excel. And of course, I had to clean and add computations which was equally exciting and I ended up with a data-set that had the follower count and also number of tweets I had.

The Date-Range in the Data-set is from 23 July. 2012 – 5 Sep. 2012. Of course, to get “better” forecast – you need to feed more historical data. In my case, the Twitter API didn’t allow me to pull ALL historical data at one go – let’s not get into details because that’s not the focus of the blog-post. But rule of thumb is that more historical data gives better forecast. And, Here are the steps I followed:

1. Loaded Data into Excel 2010. (I am using Twitter as an example here. Other real world scenario’s would be Sales Forecast). Note that I have kept it simple for the purpose of the demo.

2. Now, let’s create a forecast model.

Go to Data Mining Tab > Data Modeling > Forecast:

data mining excel forecast twitter followers

3) Forecast Wizard:

a. Getting Started with Forecast Wizard: NEXT

b. Select Source Data. Then Press NEXT

c. Select input columns. In this case, I selected Date as Time Stamp and Total Follower Count & Total Tweet Count as Input columns.

– Notice the Parameters Button? That is used to set the configuration of how the (Time Series) algorithm runs. For the purpose of this demo – I am going to explore that.

d. Finish.

4) It forecast-ed (Using the Time Series Data Mining Algorithm) the follower count for next week and if you can see – it says that on 12th Sep 2012, I would have 438 followers which is +3 when compared to today’s (5th Sep) follower count.

forecast twitter followers using excel data mining

5) Few Notes

a. I had selected Total Tweet count just to show that It can forecast more than one variable at same time. Here the model used the Date Column as the time-stamp while forecasting.

b. Of course, this may not happen for REAL because your follower count can go up or down based on

  • Tweet (Quality Tweets!) Frequency
  • Number-of-bots-that-decide-to-follow-you (kidding!)
  • Re-Tweeting interesting content and replying your followers. Basically being social!
  • If tweet gets picked by someone famous, your count increases
  • Other real life “surprises”..

Here’s the point though: This was just a Toy Example to show “forecasting” with Excel Data Mining – If I explore it further, I would document my experiences!

And oh, BTW here’s a nice video by @MarkTabNet and @SolidQ (SolidQ: I work at this amazing company!) on “Microsoft Data Mining Demo — Forecasting (SQL Server 2008 and Excel 2007”. And MarkTabNet is a great resource for Data Miners, Check it out!

[video] Data Science is not NEW – it’s just that we live in a VERY special time!

  • Data Analysis is NOT new
  • Data Mining is NOT new
  • Predictive Analytic is NOT new
  • Machine Learning is NOT new
  • Statistics is NOT new
  • And Data Science is NOT new

So what’s new?

  • The rate at which data is produced.
  • The variety in Data that’s being produced.
  • The “amount” of data that’s being produced.

And we did not have Tools and Techniques before – But now we do! Indeed, We live in a VERY special time!

Here’s a nice 5 minute video titled “Data Science: Beyond Intuition”.

Link to video:  AND Thanks Ryan Swanstrom for sharing!