Hi All,
I have been working with large excel files ranging from 500,000 to 800,000 plus rows of data.
Cleaning, matching and merging data is time consuming. I am wondering if you have an good approach of dealing with large data files?
Any good data cleaning and matching tools to recommend?
How I work with the large data files currently:
1. Put the shifted columns back to the appropriate columns
2. Clean out data that's not useful
3. Match the customers from the original file to other files that's different.
4. Since there's no IDs in the original file, I look up the data using customer_name, address, email address and city to match other data files, if there's a match, the customerID will show up via vlookup.
5. At first, I started with vlooking up the email, then the address
6. After vlooking up like that, I have two different columns with different IDS.
7. I wanted to merge them together on as one column. Currently, I use a filter on ID using Address to get rid of the blanks and N/A.
And for the ID using email column, I use = to the values of the " ID using address" column
For example:
[TABLE="class: grid, width: 500"]
<tbody>[TR]
[TD]CustomerName[/TD]
[TD]ID using email[/TD]
[TD]ID using address[/TD]
[/TR]
[TR]
[TD]ABC company[/TD]
[TD]123[/TD]
[TD]N/A[/TD]
[/TR]
[TR]
[TD]HairSalon556[/TD]
[TD]N/A[/TD]
[TD]4351[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]N/A[/TD]
[TD]235325[/TD]
[/TR]
[TR]
[TD]Makeupforever[/TD]
[TD]789[/TD]
[TD]789[/TD]
[/TR]
[TR]
[TD]HairExtension[/TD]
[TD]5437[/TD]
[TD]N/A[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]N/A[/TD]
[TD]345567[/TD]
[/TR]
</tbody>[/TABLE]
[TABLE="class: grid, width: 500"]
<tbody>[TR]
[TD]CustomerName[/TD]
[TD]ID using email[/TD]
[/TR]
[TR]
[TD]ABC company[/TD]
[TD]123[/TD]
[/TR]
[TR]
[TD]HairSalon556[/TD]
[TD]4351[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]235325[/TD]
[/TR]
[TR]
[TD]Makeupforever[/TD]
[TD]789[/TD]
[/TR]
[TR]
[TD]HairExtension[/TD]
[TD]5437[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]34556[/TD]
[/TR]
</tbody>[/TABLE]
I think this method works well for small data files. However, for large data files, it is very time consuming.
It took me almost a week to do all the data cleaning, ID Vlookups and data merging. If would be great if you have any idea of how to do this more efficiently. Thank you so much.
I have been working with large excel files ranging from 500,000 to 800,000 plus rows of data.
Cleaning, matching and merging data is time consuming. I am wondering if you have an good approach of dealing with large data files?
Any good data cleaning and matching tools to recommend?
How I work with the large data files currently:
1. Put the shifted columns back to the appropriate columns
2. Clean out data that's not useful
3. Match the customers from the original file to other files that's different.
4. Since there's no IDs in the original file, I look up the data using customer_name, address, email address and city to match other data files, if there's a match, the customerID will show up via vlookup.
5. At first, I started with vlooking up the email, then the address
6. After vlooking up like that, I have two different columns with different IDS.
7. I wanted to merge them together on as one column. Currently, I use a filter on ID using Address to get rid of the blanks and N/A.
And for the ID using email column, I use = to the values of the " ID using address" column
For example:
[TABLE="class: grid, width: 500"]
<tbody>[TR]
[TD]CustomerName[/TD]
[TD]ID using email[/TD]
[TD]ID using address[/TD]
[/TR]
[TR]
[TD]ABC company[/TD]
[TD]123[/TD]
[TD]N/A[/TD]
[/TR]
[TR]
[TD]HairSalon556[/TD]
[TD]N/A[/TD]
[TD]4351[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]N/A[/TD]
[TD]235325[/TD]
[/TR]
[TR]
[TD]Makeupforever[/TD]
[TD]789[/TD]
[TD]789[/TD]
[/TR]
[TR]
[TD]HairExtension[/TD]
[TD]5437[/TD]
[TD]N/A[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]N/A[/TD]
[TD]345567[/TD]
[/TR]
</tbody>[/TABLE]
[TABLE="class: grid, width: 500"]
<tbody>[TR]
[TD]CustomerName[/TD]
[TD]ID using email[/TD]
[/TR]
[TR]
[TD]ABC company[/TD]
[TD]123[/TD]
[/TR]
[TR]
[TD]HairSalon556[/TD]
[TD]4351[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]235325[/TD]
[/TR]
[TR]
[TD]Makeupforever[/TD]
[TD]789[/TD]
[/TR]
[TR]
[TD]HairExtension[/TD]
[TD]5437[/TD]
[/TR]
[TR]
[TD]BeautySupply[/TD]
[TD]34556[/TD]
[/TR]
</tbody>[/TABLE]
I think this method works well for small data files. However, for large data files, it is very time consuming.
It took me almost a week to do all the data cleaning, ID Vlookups and data merging. If would be great if you have any idea of how to do this more efficiently. Thank you so much.