Guest Blog: How we use Fuzzy Lookup add-in in our company to solve data inconsistency problems:

Standard

This is a Guest Blog from Mantresh Jain.

About Mantresh Jain:

Mantresh Jain is a C Level Executive at SMB in manufacturing domain based out of India. He has bachelor’s degree from a business school. And he holds a special interest in how businesses can leverage newest Information Technology Tools for optimizing business processes. He is working on a company-wide ERP implementation and is a single point of contact for the implementation process. He spends his free time on computer games of all kinds! Link with him here: http://www.linkedin.com/pub/mantresh-jain/43/562/749

 How did they discover Fuzzy Look-up add-in for Excel (A write-up by Paras)?

Some months ago, Mantresh approached me to see if I knew any tool that would help him deal with “messy” data. On Further questions, I learned that

–          Messy data = lots of duplicates

–          Uses SQL Server Express & do NOT have plans to upgrade to SQL Server versions that include Data Quality Services and/or Master Data Services. Remember the context here: They are a small and medium size business.

–          Do use Excel – a lot!

–          Do not have folks w/ “SQL” knowledge

With this requirements, I asked him to see if an add-in for excel called “Fuzzy Lookup” meets their need. After trying it out: here’s Mantresh’s experience of using Fuzzy Lookup add-in for Excel in their organization:

Summary:

In my company we are implementing ERP software. I faced a problem of Data migration from two fox Pro based software’s to SQL (for ERP)

More Details:

Two fox Pro Software’s worked independently form each other. And as a result each of them had their Separate Databases.

Lets Call them FX1 and FX2.

Now I wanted to import Account Master Data from them to SQL, Here are the fields in our Account Master data:

Name, Address, Bank Details, Phone Number among other fields

Problem

Both systems had issues of data Duplication and Data Inconsistency

To give you an example, I faced following problems:

1) FX1 had around 3500 entries and FX2 had 2400 entries

Now in FX1 out of 3500 around 2000 were same as FX2

Also FX2 had around 2000 entries same as FX1

Now i wanted to import only unique Account Master gathered by “combining” the two systems to SQL.

Example:

FX1 has “VMS Industires” while FX2 has “V.M.S Industries”

Solution

Fuzzy Look up add-in for Excel.

Step 1) Import data from both databases to excel

Step 2) Using Fuzzy Look up to find data matching to each other based on variable conditions that we select.

Step 3) It reorganizes data as

FX1 entry 1st matching FX2 Entry
2nd Matching FX2 entry

This is how we  find Duplicate entries and then clean our data-set

Benefit

If not for Fuzzy Look I would have had to manually match each entry to each other which would have taken estimated 60 to 100 Man Hours but with Fuzzy Look-up, we did the job in 24 Man Hours Only.

———————-

Conclusion by Paras:

Thanks Mantresh for sharing your experience!

And here’s a related post:
How to clean similar textual data in Excel via Fuzzy lookup add-in?

Advertisements

Excel: How to split the content of one excel cell into separate columns?

Standard

I wanted to explore a data-set in excel. I thought I would do that using Excel. The problem was that when I opened the data-set, data was in one column. It was “supposed” to be in different columns but no – I found that it was in one excel cell. This was not Excels fault – it was just the way the data-set was defined. Here’s what I mean:

open a data set in excel text to columns

Can you see that the TWO values are in ONE column?

Problem? Yes. How do we solve it? Turns out there’s a nice feature called “Text to Columns” that should be of help here. Let’s try that:

1) Excel Toolbar > Data > Data Tools > Text to columns

excel text to columns data tools

2) This should open the “convert text to columns wizard”

Step 1: I chose Delimited

Step 2: I chose Comma as the delimiter.

Here are other delimiters that you could choose:

split an excel columns tab semicolon comma space user defined

Step 3: I left the default choices. But you could change the data format if you want. You could also choose the destination cells.

Clicked on FINISH

3) Nice! Here’s what I wanted – And I added a header row.

excel an excel value cell splitted into seperate columns by comma

And my data exploration:

step one for building a predictive model that is data exploration

Conclusion:

In this blog-post, we saw how one can split excel cell into separate columns at each comma, tab, space, semicolon or user-defined-character.

Visualizing dataset of 2 million+ passwords:

Standard

I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:

1) Password Length vs Frequency

1 how to choose password password length

2) Percentage of passwords having at least one special character vs passwords having no special character:

2 passwords that have special character vs the one's that dont

3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.

Answer: 1.4856%

Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)

4 passwords having combination of alphabets plus numbers and special characters

That’s about it for now – it was fun!

 

And for those interested, here are the few behind the scene technical details:

Tools I used:

1. Excel & 2. SQL Server

Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.

Initial state:

2 million passwords in a .txt file.

Information I appended to the data-set using TSQL:

1. Length of password

2. Has Alphabets?

[a-zA-Z]

3. Has Numbers?

[0-9]

4. Has special Characters?

[^a-zA-Z0-9]

Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”

That’s about it for the technical details. Ping me if interested!

 

Data Mining: Classification VS Clustering (cluster analysis)

Standard

For someone who is new to Data mining, classification and clustering can seem similar because both data mining algorithms essentially “divide” the datasets into sub-datasets; But there is difference between them and this blog-post, we’ll see exactly that:

CLASSIFICATION CLUSTERING
  • We have a Training set containing data that have been previously categorized
  • Based on this training set, the algorithms finds the category that the new data points belong to
  • We do not know the characteristics of similarity of data in advance
  • Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data
Since a Training set exists, we describe this technique as Supervised learning Since Training set is not used, we describe this technique as Unsupervised learning
Example:We use training dataset which categorized customers that have churned. Now based on this training set, we can classify whether a customer will churn or not. Example:We use a dataset of customers and split them into sub-datasets of customers with “similar” characteristics. Now this information can be used to market a product to a specific segment of customers that has been identified by clustering algorithm

If you want to learn about Data Mining, check out the “free Book in PDF format: Mining the massive data-sets”.

Addressing few Q’s a reader had about Google’s BigData offering BigQuery:

Standard

After reading First Impression: Google’s BigData offering called BigQuery , a reader (Shadab Shah) had few questions about it and in this blog-post, I am going to address those questions:

Q1. Any browser based Tool’s to Query data in BigQuery?

A1: They have a Browser Based Tool which they call “BigQuery Browser Tool” using which you can Query Data.

Apart from browser based, there are other tools too:

1) a command line tool called “BQ command-line tool. You can find more information here: https://developers.google.com/bigquery/

2) API. one can “include” big data analytic capabilities into a web app via RESTFul API. (Point #2 content credit: Michael Manoochehri’s comment)

Q2) Where is the Data Stored? 

If i just say “Google Cloud” that would not be a complete answer. There’s a complementary service called “Google CLOUD SQL” and so I do not want you to confuse data stored for BigQuery with “Google cloud SQL”.Theres’ a difference between BigQuery and Google cloud SQL, you can read that here: https://developers.google.com/bigquery/docs/overview

Having said that, it’s stored on Google’s cloud and if you wish to use BigQuery – you’ll have to upload your data-set in a CSV format and if you do so, it’s stored in Google cloud and is ready to be analyzed via BigQuery.

Q3) Where do I find lots of data to play with BigQuery?

Google has few sample data-sets that you can play with:

bigquery sample data

That’s about it for this post. Thanks Shadab Shah for the questions, I hope this post is useful.

How to import data from Azure Datamarket to Excel

Standard

Short answer: Download Azure Datamarket Excel AddIn

And you want to know why i am writing a blog post for it? spare couple of minutes and you will realize that you were better off just knowing the short answer. yeah seriously. And still if you are adamant on reading it – please drop me an email on contact[at]parasdoshi[dot]com, I want to talk to you! seriously!

Have you ever wondered how to import data from Azure Data market to PowerPivot Excel? And you know what I did? – since I knew we could load data from datamarket into powerpivot, I did that! There is an inbuilt support btw:


Now, I copied this data (CTRL C) and tried pasting it in Excel sheet (CTRL V). And you know what – nothing happened! So tried again! And again nothing happened. Now, i again selected the data from powerpivot window via right clicked -> copy. Went to excel worksheet and right clicked -> paste special. And guess, my laptop froze for a while and in a weird way I was happy because I thought that the copy was successful! But again it did not work. If it had, well I would not have written this blog post.

Any-who, so it was time to read some whitepapers blog posts. some googling and binging. And you know what, while I was binging and googling stuff, I liked the bing wallpaper, so i had to change my wallpaper. So I did that! Look at it, don’t you like it too:

And after a little tweeting facebooking searching, i found EXCEL ADD IN!! yeah! you can download it here: https://datamarket.azure.com/addin

After installation, you will find under the DATA tab. you can sign in to datamarket directly from there. you can create a datamarket account if you do not have one. you allow access if you have not done so before. And then you can browse available data-sets! it’s that easy.

Then you could just select the data-set you want to import and click on “import data”:

And then click on “import data” that you see at the bottom of the below screenshot.

And that’s it – downloading started! optionally you could filter the data if you want.

That’s it. Moral of the story:

Download Excel addin to import data from azure datamarket to excel

 

BTW I am using Excel 2010!