Data Preprocessing/Cleaning with TSSFL Stack, Pandas, and Gspread

Including Cython, Jython, IronPython, PyPy, Django framework, and interpreters: Ipython, IPython Jupyter/Notebook, CPython


Post Reply
User avatar
Eli
Senior Expert Member
Reactions: 183
Posts: 5334
Joined: 9 years ago
Location: Tanzania
Has thanked: 75 times
Been thanked: 88 times
Contact:

#1

This topic is an extension of the topics Automating Spreadsheets and Automating Multiple Excel Sheets with TSSFL ODF. It briefly showcases how various data tasks, such as preprocessing/cleaning can be carried out. Data cleaning/preprocessing is an important screening stage before carrying out data analysis. The goal of data preprocessing is to discard irrelevant and redundant or noisy information and unreliable data which can produce misleading results.

The Python snippet below performs the following tasks:
  1. Read data from the spreadsheet by using spreadsheet ID, create a worksheet, and then create a Pandas DataFrame (df1) from the worksheet.
  2. Preprocesses data by creating the second Pandas DataFrame (df2) for which all strings named HIV/AIDS are removed from the column named "Ailment cured", similarly, it renames all strings named "Gonorrhoea, syphilis" to "Gonorrhoea & Syphilis". Note that Python is case-sensitive.
  3. It creates a new spreadsheet and names it "A New Test Spreadsheet". It also creates a worksheet (worksheet2) within the newly created spreadsheet.
  4. It updates worksheet2 by copying the Pandas DataFrame df2 into it.
  5. Finally, it sends an updated spreadsheet to an email.
  1. import gspread
  2. import urllib.request
  3. urllib.request.urlretrieve("https://www.dropbox.com/s/m728v370159b2xm/credentials.json?dl=1", "credentials.json")
  4.  
  5. gc = gspread.service_account(filename="credentials.json")
  6. sh = gc.open_by_key("1pm1mGdRgpitrYQiGqUNSHPdR43e-ZSXCavYr-TcqtwU") #Open spreadsheet by ID
  7. worksheet = sh.sheet1
  8.  
  9. import pandas as pd
  10.  
  11. df1 = pd.DataFrame(worksheet.get_all_records())
  12. #Preprocess Data
  13. df2 = df1[~df1['Ailment cured'].isin(['HIV/AIDS'])]
  14. #df['Ailment cured'] = df['Ailment cured'].replace({'Gonorrhoea, syphilis':'Gonorrhoea & Syphilis'})
  15. df2["Ailment cured"] = df2['Ailment cured'].replace('Gonorrhoea, syphilis', 'Gonorrhoea & Syphilis')
  16. print(df2)
  17.  
  18. #Let's create a new blank spreadsheet:
  19.  
  20. sh2 = gc.create('A New Test Spreadsheet')
  21. worksheet2 = sh2.sheet1
  22.  
  23. #Let's write df2 to a new worksheet
  24. worksheet2.update([df2.columns.values.tolist()] + df2.values.tolist())
  25. #Share the new worksheet to an email:
  26. sh2.share('ey@tssfl.co', perm_type='user', role='writer')
  27.  
  28. #Finally,
  29. import os
  30. os.remove("credentials.json")
  31. #We can also combine/concatenate sheets


See more functionalities at Examples of gspread Usage
0
TSSFL -- A Creative Journey Towards Infinite Possibilities!
Post Reply

Return to “Python Programming”

  • Information
  • Who is online

    Users browsing this forum: No registered users and 10 guests