Data Preprocessing/Cleaning with TSSFL Stack, Pandas, and Gspread
Posted: Tue May 04, 2021 6:55 am
This topic is an extension of the topics Automating Spreadsheets and Automating Multiple Excel Sheets with TSSFL ODF. It briefly showcases how various data tasks, such as preprocessing/cleaning can be carried out. Data cleaning/preprocessing is an important screening stage before carrying out data analysis. The goal of data preprocessing is to discard irrelevant and redundant or noisy information and unreliable data which can produce misleading results.
The Python snippet below performs the following tasks:
See more functionalities at Examples of gspread Usage
The Python snippet below performs the following tasks:
- Read data from the spreadsheet by using spreadsheet ID, create a worksheet, and then create a Pandas DataFrame (df1) from the worksheet.
- Preprocesses data by creating the second Pandas DataFrame (df2) for which all strings named HIV/AIDS are removed from the column named "Ailment cured", similarly, it renames all strings named "Gonorrhoea, syphilis" to "Gonorrhoea & Syphilis". Note that Python is case-sensitive.
- It creates a new spreadsheet and names it "A New Test Spreadsheet". It also creates a worksheet (worksheet2) within the newly created spreadsheet.
- It updates worksheet2 by copying the Pandas DataFrame df2 into it.
- Finally, it sends an updated spreadsheet to an email.
- import gspread
- import urllib.request
- urllib.request.urlretrieve("https://www.dropbox.com/s/m728v370159b2xm/credentials.json?dl=1", "credentials.json")
- gc = gspread.service_account(filename="credentials.json")
- sh = gc.open_by_key("1pm1mGdRgpitrYQiGqUNSHPdR43e-ZSXCavYr-TcqtwU") #Open spreadsheet by ID
- worksheet = sh.sheet1
- import pandas as pd
- df1 = pd.DataFrame(worksheet.get_all_records())
- #Preprocess Data
- df2 = df1[~df1['Ailment cured'].isin(['HIV/AIDS'])]
- #df['Ailment cured'] = df['Ailment cured'].replace({'Gonorrhoea, syphilis':'Gonorrhoea & Syphilis'})
- df2["Ailment cured"] = df2['Ailment cured'].replace('Gonorrhoea, syphilis', 'Gonorrhoea & Syphilis')
- print(df2)
- #Let's create a new blank spreadsheet:
- sh2 = gc.create('A New Test Spreadsheet')
- worksheet2 = sh2.sheet1
- #Let's write df2 to a new worksheet
- worksheet2.update([df2.columns.values.tolist()] + df2.values.tolist())
- #Share the new worksheet to an email:
- sh2.share('ey@tssfl.co', perm_type='user', role='writer')
- #Finally,
- import os
- os.remove("credentials.json")
- #We can also combine/concatenate sheets
See more functionalities at Examples of gspread Usage