Excel might be thought of as old hat for a lot of data professionals, but every data professional worth their salt should know how to navigate excel. I've even heard someone say "If you can't do it in excel, your business probably doesn't need it." With all of the fancy big data tools out there, that's some claim. But, there's a hint of truth to it, many are not using excel to it's fullest capabilities and I've heard many data professionals lament that they've put so much time and effort into different dashboards and reporting tools only to be told that people would much rather view that data in excel. So, this blog post will explain how to use two very important and powerful tools in excel: Macros in VBA and Python. The addition of python is quite recent, but much anticipated and I'll only be scratching the surface here. First, we'll go over a simple VBA script to create a macro to clean up some data and then we'll do the same with a python script.
Part 1: Macros in VBA for CSV Cleanup
Understanding Macros and VBA
Before diving into the specifics of writing a macro for cleaning up CSV files, let's demystify some terms. A macro is a sequence of instructions that automates tasks. Macros can significantly reduce manual efforts in repetitive tasks, which will save a lot of time.
VBA, or Visual Basic for Applications, is the programming language used to write macros within Microsoft Office applications. It allows users to create custom functions, automate tasks, and manipulate data within Excel, which is particularly useful for handling CSV files.
Writing Your First VBA Macro
To write a macro in VBA that cleans a CSV file, you'll need to follow these steps within Excel:
Sub CleanCSV() Dim ws As Worksheet Dim rng As Range Dim cell As Range ' Assuming data is in the first worksheet Set ws = ThisWorkbook.Sheets(1) ' Assuming data starts from A1 and spans the used range Set rng = ws.UsedRange ' Loop through each cell in the range For Each cell In rng ' Trim leading and trailing spaces cell.Value = Trim(cell.Value) ' Convert to lowercase cell.Value = LCase(cell.Value) ' Remove non-alphanumeric characters using Regular Expressions With CreateObject("VBScript.RegExp") .Global = True .Pattern = "[^\w\s]" ' Pattern to keep letters, numbers, and spaces cell.Value = .Replace(cell.Value, "") End With Next cellEnd Sub
This code defines a subroutine named CleanCSV. It uses a loop to go through each cell, trims spaces, converts text to lowercase, and employs regular expressions to remove non-alphanumeric characters.
Don't forget: when you save the excel sheet with the macro you'll need to save it as a .xlsm file instead of an .xlsx file.
Recommended by LinkedIn
Part 2: CSV Cleanup with Python and openpyxl
The Rise of Python in Excel
With the recent integration of Python into Excel, automating tasks and manipulating data has become more versatile and powerful. Python, known for its simplicity and robust libraries, offers extensive capabilities for data analysis and manipulation, making it an excellent tool for cleaning CSV files directly within Excel.
Writing a Python Script for CSV Cleanup
To clean a CSV file using Python in Excel, you'll write a script that reads the data, processes it, and writes back the cleaned data. Here's how you can accomplish this:
import reimport openpyxldef clean_data(cell_value): # Trim leading and trailing spaces, commas, and periods cleaned = cell_value.strip(" ,.") # Convert to lowercase cleaned = cleaned.lower() # Remove non-alphanumeric characters except spaces cleaned = re.sub(r'[^\w\s]', '', cleaned) return cleaneddef clean_csv(file_name): # Load the workbook and select the active worksheet wb = openpyxl.load_workbook(file_name) ws = wb.active # Assuming you want to clean the active sheet # Iterate through each cell in the worksheet for row in ws.iter_rows(): for cell in row: if cell.value: # Check if the cell contains data # Clean the cell data and update the cell cell.value = clean_data(str(cell.value)) # Save the cleaned data into a new file or overwrite the existing one cleaned_file_name = "cleaned_" + file_name wb.save(cleaned_file_name) print(f"Cleaned data saved to {cleaned_file_name}")# Specify the Excel file nameexcel_file_name = "BlogMacroExcel.xlsm"clean_csv(excel_file_name)
This script defines two functions: clean_data, which cleans individual cell values, and clean_csv, which opens the specified Excel file, iterates over each cell in the active sheet, cleans the cell values using clean_data, and then saves the cleaned data into a new file prefixed with "cleaned_". You can adjust the clean_csv function to target a specific sheet by name if needed, using wb['SheetName'] instead of wb.active.
Conclusion
Choosing between VBA macros and Python scripts for cleaning data in Excel depends on your specific needs, skill set, and the complexity of the task at hand. VBA, being native to Excel, offers seamless integration and is well-suited for straightforward tasks within Excel itself. Its primary advantage lies in its accessibility and that fact that the user will not have to deal with the nightmare of managing their python environment. However, VBA can be limiting for more complex data manipulation tasks and is generally considered less powerful than Python.
On the other hand, Python, with its vast array of libraries provides a more robust solution for complex data cleaning tasks. The ability to leverage libraries like Pandas makes Python particularly appealing for tasks involving large datasets or requiring sophisticated data transformation. The main drawback is the need for an external setup or integration into Excel, which might be a barrier for those less familiar with programming environments outside of Excel.
VBA might be the go-to for quick, Excel-specific tasks, while Python could be the better choice for more complex data manipulations or when working within a broader data analysis workflow.
Github: https://github.com/direnz
Linkedin: https://www.linkedin.com/in/charlesdirenzo/