The purpose of this blog post is to show the strength of Python Pandas library in combination with EVL Tool Anonymization microservice. EVLTool is a powerful family of services that provides data transformations in several business areas, across many platforms, with huge data volumes.
Python is very efficient at processing tabular data like csv and excel files, google spreadsheets, and SQL tables. Especially Python’s PANDAS library enables quick creation of various table operations suitable even for novices.
EVLTool Anonymization example
EVL Tool Anonymization microservice processes anonymization of input data by applying functions declared in an excel configuration file. It is an especially useful example as the results of anonymization are easy to see. We are going to test some anonymization functions and show the results in several steps:
- Read input data csv file and configuration excel file into dataframes
- Change one anonymization function and save the new configuration csv file
- Call the linux EVL job for running the anonymization
- Show the comparison of original and anonymized data for one chosen entity on the screen
- Create a new excel file with two sheets: one with the current configuration and another one containing samples of original and anonymized data
- Open the excel and let the user view the results
Pandas DataFrame
Python Pandas library uses the DataFrame for storing the internal representation of the input, output and configuration data. It offers several functions for reading/writing various data file types like pd.read_excel and pd.read_csv and plenty of functions for manipulation of the rows and columns in the dataframes. We will be using mainly the positioning functions. Additionally we are going to use also two Python libraries: subprocess, and os for spawning operating system programs.
Code Examples
The whole program and video are available: see details in the last chapter. Here we will briefly describe some Pandas features using a few examples of the code.
Importing Libraries
These libraries will be needed for performing some operations
import pandas as pd # import Python Pandas Data Analysis library
import subprocess # import Subprocess module for spawning processes
import os # import OS module for operating system interfaces
Input and output files definition
Declaration of directories and filenames, relative to the program location, in this case, it's in the main anonymization project folder.
config_excel_file = 'configs/sample_source.xlsx'
config_csv_file = 'configs/sample_source.csv'
source_file = 'data/source/addresses.csv'
anon_file = 'data/anon/addresses.csv'
output_excel = 'py_anon_blog_data.xlsx'
Reading files into the DataFrames
pd.read creates a dataframe from the chosen file (type). Parameters can define plenty of options like delimiters, column lists, sheet names, etc.
config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code
anon = pd.read_csv(anon_file, sep=';', usecols = ['addr_id', 'addr_type_code'
Function: changing anonymization types
source.loc[3].at[‘city’] identifies the value of the anon_type column on 4th row (counting from 0) and if it equals RANDOM it changes the value to MASK_LEFT(4). These two values can be changed to any of the available EVLTool anonymization functions. Then the config DataFrame is exported to sample_source.csv file.
def change_config():
if config.loc[3].at['anon_type'] == 'RANDOM':
config.at[3,'anon_type'] = 'MASK_LEFT(4)'
config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
config.to_csv(config_csv_file, sep = ';', index = False)
print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')
Function: running EVL job
Construct the EVL Data Anonymization job command and call it via bash
def change_config():
if config.loc[3].at['anon_type'] == 'RANDOM':
config.at[3,'anon_type'] = 'MASK_LEFT(4)'
config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
config.to_csv(config_csv_file, sep = ';', index = False)
print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')
Function: comparing original and anonymized data
Comparing original (source) city and anonymized (anon) city for all rows. source.loc[i].at[‘city’] contains the value of the cell in the row i and column name ‘city’
def compare_cities():
i = 0
print('Compare city before and after anonymization')
for i in range(len(source)):
print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])
Function: save DataFrames to the appropriate files
def save_files():
print('Saving files')
data = source.append(anon) # Append anonymized data below the source data
excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
config.to_excel(excel,'anon', index = False)
# index=True would create additional 1st column containing row numbers
data.to_excel(excel,'data', index = False)
excel.close()
Calling the functions
# Calling functions
change_config() # Function for changing the anonymization function for
run_evl() # Function runs EVL Data Anonymization job defined in the config file for
# the file addresses.csv
compare_cities() # Function for comparing values of the entity city before and after anonymization
save_files() # Function saves dataframes into excel on 2 sheets
Open the excel file
print('Starting excel. Check the changes in both sheets')
start_excel = 'start excel.exe ' + output_excel
os.system(start_excel) # Opens excel file containing new config and data
Whole program code
# This program shows some features mainly the Python Pandas module for working with DataFrames,
# excel and csv files using the EVL Data Anonymization example
# Steps:
# reads the excel configuration file and input csv file
# runs the EVL Data Anonymization program for anonymizing input file according to the configuration
# in the configuration file
# creates a new excel with the configuration on 1st sheet and merged original
# and anonymized data on the second sheet
# opens excel for checking the result
# Import necessary Python modules
import pandas as pd # import Python Pandas Data Analysis library
import subprocess # import Subprocess module for spawning processes
import os # import OS module for oprating system interfaces
# Change of anonymization for entity = city (4th column, numbering starting with 0) if the current anonymization parameter is RANDOM
def change_config():
if config.loc[3].at['anon_type'] == 'RANDOM':
config.at[3,'anon_type'] = 'MASK_LEFT(4)'
config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
config.to_csv(config_csv_file, sep = ';', index = False)
print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')
def compare_cities():
i = 0
print('Compare city before and after anonymization')
for i in range(len(source)):
print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])
# Functions for running EVL Data Anonymization program
def run_evl():
evl = "bash -c '. $HOME/.evlrc; evl run/anon/sample_source." # starting bash, setting environment
job = 'addresses' # set the anonymized entity and the approproate evl job
suffix = ".evl 1>py_blog.log 2>&1'" # add the job suffix .evl and move errors and output to the logfile
evljob = evl + job + suffix
print('Running EVL job: ',evljob)
subprocess.call(evljob, shell=True)
# Function for saving the dataframes into excel file
def save_files():
print('Saving files')
data = source.append(anon) # Append anonymized data below the source data
excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
config.to_excel(excel,'anon', index = False) # index=True would create additional 1st column containing row numbers
data.to_excel(excel,'data', index = False)
excel.close()
# MAIN
# Define input and output files
config_excel_file = 'configs/sample_source.xlsx'
config_csv_file = 'configs/sample_source.csv'
source_file = 'data/source/addresses.csv'
anon_file = 'data/anon/addresses.csv'
output_excel = 'blog_data.xlsx'
# Read the files into Dataframes
print('Reading files into dataframes')
config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])
anon = pd.read_csv(anon_file, sep=';', usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])
# Calling functions
change_config() # Function for changing the anonymization function for
run_evl() # Function runs EVL Data Anonymization job defined in the config file for
# the file addresses.csv
compare_cities() # Function for comparing values of the entity city before and after anonymization
save_files() # Function saves dataframes into excel on 2 sheets
print('Starting excel. Check the changes in both sheets')
os.system('start excel.exe blog_data.xlsx') # Opens excel
Resources
The source code of the Python program anon.py
Video contains
- Showing the contents of the original config file and data source
- running the Python program
- opening an excel file with changed data and config file
For running the code you need to download the Anonymization trial from the EVL Tool Web and store the Python program into the sample_source folder
Environment
- Windows 10
- Python 3.8.3
- WSL (Windows Subsystem for Linux)
- Ubuntu 18.0.4
- EVLTool Anonymization Trial