First Public Service Webinar: Accelerated Research Solutions with Digital Technologies

#1

Organizer: Dar Es Salaam University College of Education (www.duce.ac.tz)
Presenter: Dr Yohana Elimboto
Moderator: Dr Sixbert Mlowe
Date: 30th June 2021
Time: 2:00 PM - 2:40 PM
Platform: Zoom

Link:

https://us04web.zoom.us/j/76187059507?p ... rVVdFUWNKW
HUrdz09
Meeting ID: 761 8705 9507
Passcode: 4NhpNX

If you do not have Zoom on your device, you can join the meeting viaTSSFL ODF: integrated-zoom-video-conferencing-for-meetings-6420

Abstract: TSSFL Open Discussion Forums (TSSFL ODF) is an open-source knowledge base set of interactive forums that facilitate communications and collaborations in research to solve a wide range of real-life problems. TSSFL ODF is integrated with numerous
research tools, making it a very robust cloud platform that can be instantly deployed to provide inspiring and unprecedented cost-effective solutions in research. During this talk, Dr. Yohana will showcase how to use the TSSFL ODF integrated and the associated data science tools and data repositories, specifically Python programming language, Open Data Kit (ODK) Collect, Google drive, Google spreadsheets, Excel worksheets, DropBox, and GitHub to automate research processes from data collection to data storage, data management, data retrieval, data cleaning, data processing, data analysis, and finally, research results and reports.

Please find more information from the attached document.

#2

Live Webinar: Accelerated Research Solutions with Digital Technologies

Name: Elimboto Yohana
Institution: Dar Es Salaam University College of Education

Introduction

TSSFL ODF is a currently a cloud computing platform apart from being an interactive forum.
It is interoperable with a number of robust cloud platforms, e.g., Google.
TSSFL ODF eliminates time, spatial and technological constraints in research, teaching/learning and problem-solving, that's,
Use any internet connected device from anywhere at anytime.
Today we are focusing on data collection and research to see how it works!

#3

Let's start by running few codes:

First code:

Code: [Select all] [Expand/Collapse]

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
 
fig = plt.figure(figsize=(10,10)) #Define figure size
ax = fig.add_subplot(111, projection='3d')
plt.rcParams['legend.fontsize'] = 20
 
np.random.seed(4294967294) #Used the random seed for consistency
 
mu_vec_1 = np.array([0,0,0])
cov_mat_1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class_1_sample = np.random.multivariate_normal(mu_vec_1, cov_mat_1, 30).T
assert class_1_sample.shape == (3,30), "The matrix dimensions is not 3x30"
 
mu_vec_2 = np.array([1,1,1])
cov_mat_2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class_2_sample = np.random.multivariate_normal(mu_vec_2, cov_mat_2, 30).T
assert class_2_sample.shape == (3,30), "The matrix dimensions is not 3x30"
 
ax.plot(class_1_sample[0,:], class_1_sample[1,:], class_1_sample[2,:], 'o', markersize=10, color='green', alpha=1.0, label='Class 1')
ax.plot(class_2_sample[0,:], class_2_sample[1,:], class_2_sample[2,:], 'o', markersize=10, alpha=1.0, color='red', label='Class 2')
 
plt.title('Data Samples for Classes 1 & 2', y=1.04)
ax.legend(loc='upper right')
plt.savefig('Multivariate_distr.png', bbox_inches='tight')
plt.show()

Second code:

Code: [Select all] [Expand/Collapse]

import matplotlib.pyplot as plt
import seaborn as sns
# Load Dataset
df = sns.load_dataset('iris')
# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="reg", hue="species")
plt.show()

Third code:

Code: [Select all] [Expand/Collapse]

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
sns.set()
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', height=2.5)
plt.show()

Fourth code:

Code: [Select all] [Expand/Collapse]

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Import Dataset
 
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
 
# Plot
plt.figure(figsize=(12,10), dpi= 80)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)
 
# Decorations
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

#4

Data Collection

Data can be Collected using:

ODK Collect;
Google survey forms;
Any other means.

Data Collection with ODK Collect

Further TSSFL ODF - ODK Collect Integration for Data collection:

Data Collection with Google Survey forms:

Data Generated by Hand

#5

Data Retrieval, Processing, Visualization, and Statistics

Let's manipulate the data with the following codes:

ODK Test Spreadsheet data:

Code: [Select all] [Expand/Collapse]

#Plot some graph
#Import required libraries
import gspread
import urllib.request
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
 
"""
urllib.request.urlretrieve("https://www.dropbox.com/s/mqsyfuetv8potvd/credentials.json?dl=1", "credentials.json")
 
gc = gspread.service_account(filename="credentials.json")
sh = gc.open_by_key("1BC48PKPZW71AC6hOn1SNesvZk1PoBjE4wXwl1YlWFpY") #Open spreadsheet,
#the spreadsheet ID starts with 1019ke.... between "" in the line above
"""
#Alternative
#If your file only has one sheet, replace sheet_url
sheet_url = "https://docs.google.com/spreadsheets/d/1BC48PKPZW71AC6hOn1SNesvZk1PoBjE4wXwl1YlWFpY/edit#gid=0"
url_1 = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
 
data = pd.read_csv(url_1)
 
#worksheet = sh.sheet1
 
#Define variables
var1 = "Name"
var2 = "Age"
var3 = "Height"
 
#age = worksheet.col_values(3)[1:]
age = data[var2]
print("Ages:", age)
#height = worksheet.col_values(4)[1:]
height = data[var3]
print("Heights:", height)
 
#Pandas is extremely very useful for Google spreadsheets
#Convert the json to Pandas dataframe
#Get all data records as dictionary
#data = worksheet.get_all_records()
#df = pd.DataFrame.from_dict(data)
 
#Let's get some statistics
#age_arr = np.array(age)
#age_array = age_arr.astype(float)
#h_arr = np.array(height)
#h_array = h_arr.astype(float)
 
print("Average Age:", np.mean(age))
print("Mean Height:", np.mean(height))
print("Minimum and Maximum Age:", np.min(age), np.max(age))
print("Minimum and Maximum Height:", np.min(height), np.max(height))
 
#Let's visualize
#Graph styles and font size
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
 
#sns list of color plettes
#print(sns.color_palette('deep'), sns.color_palette("pastel"), sns.color_palette("Set2"))
 
#Let's Read Data from Google Sheets into Pandas without the Google Sheets API
#Useful for multiple sheets
sheet_id = "1BC48PKPZW71AC6hOn1SNesvZk1PoBjE4wXwl1YlWFpY"
sheet_name = "Sheet1"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
 
#If your file only has one sheet, replace sheet_url
#sheet_url = “https://docs.google.com/spreadsheets/d/1XqOtPkiE_Q0dfGSoyxrH730RkwrTczcRbDeJJpqRByQ/edit#gid=0"
#url_1 = sheet_url.replace(‘/edit#gid=’, ‘/export?format=csv&gid=’)
 
#Get Pandas dataframe
dataset = pd.read_csv(url)
#print(dataset)
 
#Names = worksheet.col_values(2)[1:]
#Names = data[var1]
#print(Names)
 
df_names = dataset[var1]
df_ages = dataset[var2]
df_heights = dataset[var3]
print(df_names)
 
#Preprocessing
plots = dataset.groupby([var1], as_index=False).mean()
#print(plots)
 
#Bar Plot in MatplotLib with plt.bar()
#Names vs Age
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('pastel')
plt.bar(dataset[var1], dataset[var2], color=colors[:5])
plt.xlabel(var1)
plt.xticks(rotation=90)
plt.ylabel('Age')
plt.title('Barplot')
plt.show()
 
#Name Vs Height
plt.figure()
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('deep')
plt.bar(dataset[var1], dataset[var3], color=colors[:6])
plt.xlabel(var1)
plt.xticks(rotation=90)
plt.ylabel('Height')
plt.title('Barplot')
plt.show()
 
#Bar Plot in Seaborn with sns.barplot()
plt.figure(figsize=(10,5), tight_layout=True)
ax = sns.barplot(x=dataset[var1], y=dataset[var2], palette='pastel', ci=None)
ax.set(title='Barplot with Seaborn', xlabel='Names', ylabel='Age')
plt.xticks(rotation=90)
plt.show()
 
#Barplot grouped data by "n" variables
plt.figure(figsize=(12, 6), tight_layout=True)
ax = sns.barplot(x=dataset[var2], y=dataset[var3], hue=dataset[var1], palette='pastel')
ax.set(title='Age vs Height' ,xlabel='Age', ylabel='Height')
ax.legend(title='Names', title_fontsize='13', loc='upper right')
plt.show()
 
#Histograms with plt.hist() or sns.histplot()
plt.figure(figsize=(10,6), tight_layout=True)
bins = [160, 165, 170, 175, 180, 185, 190, 195, 200]
# matplotlib
plt.hist(dataset[var3], bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
plt.title('Histogram')
plt.xlabel('Height (cm)')
plt.ylabel('Count')
# seaborn
ax = sns.histplot(data=dataset, x=var3, bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
ax.set(title='Histogram', xlabel='Height (cm)', ylabel='Count')
plt.show()
 
#Boxplot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.boxplot(data=dataset, x=var1, y=var2, palette='Set2', linewidth=2.5)
ax.set(title='Boxplot', xlabel='Names', ylabel='Age (Years)')
plt.xticks(rotation=90)
plt.show()
 
#Scatter plot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.scatterplot(data=dataset, x=var2, y=var3,   hue=var1, palette='Set2', s=60)
ax.set(xlabel='Age (Years)', ylabel='Height (cm)')
ax.legend(title='People', title_fontsize = 12)
plt.show()
 
#Something else
pivot = dataset.groupby([var1], as_index=False).mean()
relationship = pivot.loc[:,var2:var3]
print(relationship)
 
#Plot some graph
charts = ["bar", "line", "barh", "hist", "box", "kde", "density", "area"]
for chart_type in charts:
    relationship.plot(kind="%s" % chart_type) #Replace bar with line, barh, hist, box, kde, density, area
    plt.title("%s plot" % chart_type)
    plt.show()
 
#Seaborn
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = var2, y = var3)
plt.show()
 
plt.figure()
sns.set_style("whitegrid")
sns.lineplot(data = dataset, x = var2, y = var3)
plt.show()
 
#Hexbin
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 20
plt.title('Hexbin')
plt.hexbin(dataset[var2], dataset[var3], gridsize=nbins, color=colors[:6])
plt.show()
 
#2-D Hist
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset[var2], dataset[var3], bins=nbins, color=colors[:5])
plt.show()
 
#Set variables
x = dataset[var2]
y = dataset[var3]
z = dataset[var1]
 
#Linear Regression
plt.figure()
sns.regplot(x = x, y = y, data=dataset);
plt.show()
 
plt.figure()
sns.jointplot(x=x, y=y, data=dataset, kind="reg");
plt.show()
 
#Set seaborn style
sns.set_style("white")
 
# Basic 2D density plot
plt.figure()
sns.kdeplot(x=x, y=y)
plt.show()
 
# Custom the color, add shade and bandwidth
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Reds", shade=True, bw_adjust=.5)
plt.show()
 
# Add thresh parameter
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Blues", shade=True, thresh=0)
plt.show()
 
#Joint plot
plt.figure()
sns.jointplot(x = x,y = y,data = dataset,kind = 'hex')
plt.show()

K43 Data

Code: [Select all] [Expand/Collapse]

#Plot some graph
#Import required libraries
import gspread
import urllib.request
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
 
#Alternative
#If your file only has one sheet, replace sheet_url
sheet_url = "https://docs.google.com/spreadsheets/d/1426rTslBl2mgggHIQnLR7WivW4xor6cp5H1Su-1SrdI/edit#gid=0"
url_1 = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
 
data = pd.read_csv(url_1)
 
#worksheet = sh.sheet1
 
#Define variables
var1 = "data-Name"
var2 = "data-Age"
var3 = "data-Weight"
 
#age = worksheet.col_values(3)[1:]
age = data[var2]
print("Ages:", age)
#weight = worksheet.col_values(4)[1:]
weight = data[var3]
print("Weights:", weight)
 
#Pandas is extremely very useful for Google spreadsheets
#Convert the json to Pandas dataframe
#Get all data records as dictionary
#data = worksheet.get_all_records()
#df = pd.DataFrame.from_dict(data)
 
#Let's get some statistics
#age_arr = np.array(age)
#age_array = age_arr.astype(float)
#h_arr = np.array(weight)
#h_array = h_arr.astype(float)
 
print("Average Age:", np.mean(age))
print("Mean Weight:", np.mean(weight))
print("Minimum and Maximum Age:", np.min(age), np.max(age))
print("Minimum and Maximum Weight:", np.min(weight), np.max(weight))
 
#Let's visualize
#Graph styles and font size
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
 
#sns list of color plettes
#print(sns.color_palette('deep'), sns.color_palette("pastel"), sns.color_palette("Set2"))
 
#Let's Read Data from Google Sheets into Pandas without the Google Sheets API
#Useful for multiple sheets
sheet_id = "1426rTslBl2mgggHIQnLR7WivW4xor6cp5H1Su-1SrdI"
sheet_name = "Sheet1"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
 
#If your file only has one sheet, replace sheet_url
#sheet_url = “https://docs.google.com/spreadsheets/d/1XqOtPkiE_Q0dfGSoyxrH730RkwrTczcRbDeJJpqRByQ/edit#gid=0"
#url_1 = sheet_url.replace(‘/edit#gid=’, ‘/export?format=csv&gid=’)
 
#Get Pandas dataframe
dataset = pd.read_csv(url)
#print(dataset)
 
#Names = worksheet.col_values(2)[1:]
#Names = data[var1]
#print(Names)
 
df_names = dataset[var1]
df_ages = dataset[var2]
df_weights = dataset[var3]
print(df_names)
 
#Preprocessing
plots = dataset.groupby([var1], as_index=False).mean()
#print(plots)
 
#Bar Plot in MatplotLib with plt.bar()
#Names vs Age
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('pastel')
plt.bar(dataset[var1], dataset[var2], color=colors[:5])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Age')
plt.title('Barplot')
plt.show()
 
#Name Vs Weight
plt.figure()
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('deep')
plt.bar(dataset[var1], dataset[var3], color=colors[:6])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Weight')
plt.title('Barplot')
plt.show()
 
#Bar Plot in Seaborn with sns.barplot()
plt.figure(figsize=(10,5), tight_layout=True)
ax = sns.barplot(x=dataset[var1], y=dataset[var2], palette='pastel', ci=None)
ax.set(title='Barplot with Seaborn', xlabel='Names', ylabel='Age')
plt.xticks(rotation=90)
plt.show()
 
#Barplot grouped data by "n" variables
plt.figure(figsize=(12, 6), tight_layout=True)
ax = sns.barplot(x=dataset[var2], y=dataset[var3], hue=dataset[var1], palette='pastel')
ax.set(title='Age vs Weight' ,xlabel='Age', ylabel='Weight')
ax.legend(title='Names', title_fontsize='13', loc='upper right')
plt.show()
 
#Histograms with plt.hist() or sns.histplot()
plt.figure(figsize=(10,6), tight_layout=True)
bins = [40, 50, 60, 70, 80]
# matplotlib
plt.hist(dataset[var3], bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
plt.title('Histogram')
plt.xlabel('Weight (cm)')
plt.ylabel('Count')
# seaborn
ax = sns.histplot(data=dataset, x=var3, bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
ax.set(title='Histogram', xlabel='Weight (cm)', ylabel='Count')
plt.show()
 
#Boxplot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.boxplot(data=dataset, x=var1, y=var2, palette='Set2', linewidth=2.5)
ax.set(title='Boxplot', xlabel='Names', ylabel='Age (Years)')
plt.xticks(rotation=90)
plt.show()
 
#Scatter plot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.scatterplot(data=dataset, x=var2, y=var3,   hue=var1, palette='Set2', s=60)
ax.set(xlabel='Age (Years)', ylabel='Weight (kgs)')
ax.legend(title='People', title_fontsize = 12)
plt.show()
 
#Something else
pivot = dataset.groupby([var1], as_index=False).mean()
relationship = pivot.loc[:,var2:var3]
print(relationship)
 
#Plot some graph
charts = ["bar", "line", "barh", "hist", "box", "kde", "density", "area"]
for chart_type in charts:
    relationship.plot(kind="%s" % chart_type) #Replace bar with line, barh, hist, box, kde, density, area
    plt.title("%s plot" % chart_type)
    plt.show()
 
#Seaborn
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = var2, y = var3)
plt.show()
 
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = var3, y = var2)
plt.show()
 
#replot
plt.figure()
sns.set_theme(style="darkgrid")
sns.relplot(x=var2, y=var3, hue=var1, data=data);
plt.show()
 
#Hexbin
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset[var2], dataset[var3], gridsize=nbins, color=colors[:5])
plt.show()
 
#Hexbin 2
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset[var2], dataset[var3], gridsize=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#2-D Hist
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset[var2], dataset[var3], bins=nbins, color=colors[:6])
plt.show()
 
#2-D Hist 2
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset[var2], dataset[var3], bins=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#Set variables
x = dataset[var2]
y = dataset[var3]
z = dataset[var1]
 
#Linear Regression
plt.figure()
sns.regplot(x = x, y = y, data=dataset);
plt.show()
 
plt.figure()
sns.jointplot(x=x, y=y, data=dataset, kind="reg");
plt.show()
 
#Set seaborn style
sns.set_style("white")
 
# Basic 2D density plot
plt.figure()
sns.kdeplot(x=x, y=y)
plt.show()
 
# Custom the color, add shade and bandwidth
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Reds", shade=True, bw_adjust=.5)
plt.show()
 
# Add thresh parameter
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Blues", shade=True, thresh=0)
plt.show()
 
#Joint plot
plt.figure()
sns.jointplot(x = x,y = y,data = dataset,kind = 'hex')
plt.show()

Elimboto_DSK Data - Categorical

Code: [Select all] [Expand/Collapse]

#Plot some graph
#Import required libraries
import gspread
import urllib.request
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from statsmodels.graphics.mosaicplot import mosaic
#REDCap
 
#Let's visualize
#Graph styles and font size
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
 
sheet_id = "1pm1mGdRgpitrYQiGqUNSHPdR43e-ZSXCavYr-TcqtwU"
sheet_name = "Sheet1"
url_1 = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
 
data = pd.read_csv(url_1)
 
#print(data)
#Drop first row
#df = data.drop(labels=0, axis=0)
#df = data.drop(data.index[0])
df = data[~data['Ailment cured'].isin(['HIV/AIDS'])]
#df['Ailment cured'] = df['Ailment cured'].replace({'Gonorrhoea, syphilis':'Gonorrhoea & Syphilis'})
df["Ailment cured"] = df['Ailment cured'].replace('Gonorrhoea, syphilis', 'Gonorrhoea & Syphilis')
#print(df)
 
#Growth form vs Citation
plt.figure(figsize=(8,5))
sns.boxplot(x='Growth form',y='Citation',data=data, palette='rainbow')
plt.show()
 
#Citation vs Growth form
plt.figure(figsize=(8,5))
sns.boxplot(x='Citation',y='Growth form',data=data, palette='rainbow')
plt.show()
 
#Citation vs Growth form
plt.figure(figsize=(8,5))
sns.boxplot(x='Citation',y='Part used',data=data, palette='rainbow')
plt.tight_layout() #figure.savefig('myplot.png', bbox_inches='tight')
plt.show()
 
#Citation vs Ailment cured
plt.figure(figsize=(10,5))
sns.boxplot(x=df["Ailment cured"],y=df['Citation'],data=df, palette='rainbow')
plt.xlabel("Ailment cured", labelpad=15)
plt.tight_layout()
plt.show()
 
#Swarm plot
fig = plt.gcf()
fig.set_size_inches(30, 30)
sns.catplot(x="Citation", y="Scientific name", hue="Ailment cured", kind="swarm", data=df)
plt.tight_layout()
plt.show()
 
 
#Adding hue
#Citation vs Growth form
plt.figure(figsize=(8,5))
sns.boxplot(x='Citation',y='Growth form',data=data, hue ='Part used', palette='rainbow')
plt.tight_layout()
plt.show()
 
plt.figure(figsize=(8,5))
sns.boxplot(x='Citation',y='Growth form',data=data, hue ='Ailment cured', palette='rainbow')
plt.tight_layout()
plt.show()
 
#Violin plots
plt.figure(figsize=(8,6))
sns.violinplot(x='Citation',y='Growth form',data=data, hue ='Part used', palette='rainbow')
plt.show()
 
#Violin plots
plt.figure(figsize=(8,6))
sns.violinplot(x='Citation',y='Growth form',data=data, hue ='Ailment cured',palette='rainbow')
plt.show()
 
#Boxen plots
plt.figure(figsize=(8,6))
sns.boxenplot(x='Citation',y='Growth form',data=data, hue ='Part used', palette='rainbow')
plt.show()
 
plt.figure(figsize=(8,6))
sns.boxenplot(x='Citation',y='Part used',data=data, hue ='Ailment cured', palette='rainbow')
plt.tight_layout()
plt.show()
 
#Bar plots
plt.figure(figsize=(12,6))
sns.barplot(x='Growth form',y='Citation',data=data, palette='rainbow', hue='Part used')
plt.tight_layout()
plt.show()
 
plt.figure(figsize=(12,6))
sns.barplot(x='Ailment cured',y='Citation',data=data, palette='rainbow', hue='Part used')
plt.tight_layout()
plt.legend(loc=1)
plt.show()
 
#Point plot
plt.figure(figsize=(10,6))
sns.pointplot(x='Citation',y='Growth form',data=data)
plt.show()
 
plt.figure(figsize=(10,6))
sns.pointplot(x='Citation',y='Growth form',data=data, hue='Part used')
plt.show()
 
plt.figure(figsize=(10,6))
sns.pointplot(x='Citation',y='Growth form',data=data, hue='Part used')
plt.show()
 
plt.figure(figsize=(10,6))
sns.pointplot(x='Citation',y='Growth form',data=data, hue='Ailment cured')
plt.show()
 
#Count plot
plt.figure(figsize=(10,6))
sns.countplot(x='Growth form',data=data, palette='rainbow')
plt.show()
 
plt.figure(figsize=(10,6))
sns.countplot(x='Growth form',data=data, hue='Part used', palette='rainbow')
plt.legend(loc=1)
plt.show()
 
plt.figure(figsize=(10,6))
sns.countplot(x='Growth form',data=data, hue='Ailment cured', palette='rainbow')
plt.legend(loc=2)
plt.show()
 
 
#Strip plot - Categorical Scatter Plots
plt.figure(figsize=(12,8))
sns.stripplot(x='Citation', y='Growth form', data=data, jitter=True, hue= 'Part used', dodge=True, palette='viridis')
plt.show()
 
#Swarm plots
plt.figure(figsize=(10,6))
sns.swarmplot(x='Citation', y='Ailment cured', data=data, hue='Growth form', dodge=True, palette='viridis')
plt.tight_layout()
plt.show()
 
"""
#Combining plots
plt.figure(figsize=(12,8))
sns.violinplot(x='Citation',y="Growth form", data=data, hue='Part used', dodge='True', palette='rainbow')
sns.swarmplot(x='Citation',y="Growth form", data=data, hue='Part used', dodge='True', color='grey', alpha=.8, s=4)
plt.show()
 
#Plot 2
plt.figure(figsize=(12,8))
sns.boxplot(x='Citation',y='Part used',hue='Growth form',data=data, palette='rainbow')
sns.swarmplot(x='Citation',y='Part used',hue='Growth form', dodge=True,data=data, alpha=.8,color='grey',s=4)
 
#Plot 3
plt.figure(figsize=(12,7))
sns.barplot(x='Growth form',y='Citation',data=data, palette='rainbow', hue='Part used')
sns.stripplot(x='Growth form',y="Citation",data=data, hue='Citation', dodge='True', color='grey', alpha=.8, s=2)
plt.show()
 
#Faceting Data with Catplot
#https://towardsdatascience.com/a-complete-guide-to-plotting-categorical-variables-with-seaborn-bfe54db66bec
g = sns.catplot(x='Citation',y='Growth form', col = 'Local name', data=data,
              kind='bar', aspect=.6, palette='Set2')
(g.set_axis_labels("Class", "Survival Rate")
.set_titles("{col_name}")
.set(ylim=(0,1)))
plt.tight_layout()
plt.savefig('seaborn_catplot.png', dpi=1000)
"""
 
categorical_features = ["Growth form", "Part used", "Ailment cured", "Citation"]
fig, ax = plt.subplots(1, len(categorical_features), figsize=(16,8))
for i, categorical_feature in enumerate(data[categorical_features]):
    data[categorical_feature].value_counts().plot(kind="bar", ax=ax[i]).set_title(categorical_feature)
plt.tight_layout()
plt.show()
 
"""
#print(data)
#print(data['Local Name'])
data['Growth form'].value_counts().plot(kind='bar')
plt.show()
#data['Growth form'].value_counts().plot(kind='hist')
 
plt.figure()
from statsmodels.graphics.mosaicplot import mosaic
plt.rcParams['font.size'] = 16.0
mosaic(data, ['Growth form', 'Part used']);
plt.show()
"""
plt.figure()
sns.barplot(x=df['Growth form'].head(3),y=df['Citation'],data=df)
plt.show()

TSSFL ODF Google Survey Form responses Data:

Code: [Select all] [Expand/Collapse]

#Plot some graph
#Import required libraries
import gspread
import urllib.request
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
 
#Let's Read Data from Google Sheets into Pandas without the Google Sheets API
#Useful for multiple sheets
sheet_id = "1UnkRYcOhLFMgyT_CzByvupvdaD5cL5b_nCcOeoy1uy8"
sheet_name = "Sheet1"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
 
#Get Pandas dataframe
dataset = pd.read_csv(url)
 
#worksheet = sh.sheet1
#age = worksheet.col_values(3)[1:]
age = dataset["Age"]
print("Ages:", age)
#Height = worksheet.col_values(4)[1:]
Height = dataset["Height"]
print("Heights:", Height)
 
#Pandas is extremely very useful for Google spreadsheets
#Convert the json to Pandas dataframe
#Get all data records as dictionary
#data = worksheet.get_all_records()
#df = pd.DataFrame.from_dict(data)
 
#Let's get some statistics
#age_arr = np.array(age)
#age_array = age_arr.astype(float)
#h_arr = np.array(Height)
#h_array = h_arr.astype(float)
 
print("Average Age:", np.mean(age))
print("Mean Height:", np.mean(Height))
print("Minimum and Maximum Age:", np.min(age), np.max(age))
print("Minimum and Maximum Height:", np.min(Height), np.max(Height))
 
#Let's visualize
#Graph styles and font size
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
 
#sns list of color plettes
#print(sns.color_palette('deep'), sns.color_palette("pastel"), sns.color_palette("Set2"))
#If your file only has one sheet, replace sheet_url
#sheet_url = “https://docs.google.com/spreadsheets/d/1XqOtPkiE_Q0dfGSoyxrH730RkwrTczcRbDeJJpqRByQ/edit#gid=0"
#url_1 = sheet_url.replace(‘/edit#gid=’, ‘/export?format=csv&gid=’)
 
#print(dataset)
 
#Names = worksheet.col_values(2)[1:]
#Names = data["Name"]
#print(Names)
 
df_names = dataset["Name"]
df_ages = dataset["Age"]
df_Heights = dataset["Height"]
print(df_names)
 
#Preprocessing
plots = dataset.groupby(['Name'], as_index=False).mean()
#print(plots)
 
#Bar Plot in MatplotLib with plt.bar()
#Names vs Age
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('pastel')
plt.bar(dataset['Name'], dataset['Age'], color=colors[:5])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Age')
plt.title('Barplot')
plt.show()
 
#Name Vs Height
plt.figure()
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('deep')
plt.bar(dataset['Name'], dataset['Height'], color=colors[:6])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Height')
plt.title('Barplot')
plt.show()
 
#Bar Plot in Seaborn with sns.barplot()
plt.figure(figsize=(10,5), tight_layout=True)
ax = sns.barplot(x=dataset['Name'], y=dataset['Age'], palette='pastel', ci=None)
ax.set(title='Barplot with Seaborn', xlabel='Names', ylabel='Age')
plt.xticks(rotation=90)
plt.show()
 
#Barplot grouped data by "n" variables
plt.figure(figsize=(12, 6), tight_layout=True)
ax = sns.barplot(x=dataset['Age'], y=dataset['Height'], hue=dataset['Name'], palette='pastel')
ax.set(title='Age vs Height' ,xlabel='Age', ylabel='Height')
ax.legend(title='Names', title_fontsize='13', loc='upper right')
plt.show()
 
#Histograms with plt.hist() or sns.histplot()
plt.figure(figsize=(10,6), tight_layout=True)
bins = [40, 50, 60, 70, 80]
# matplotlib
plt.hist(dataset['Height'], bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
plt.title('Histogram')
plt.xlabel('Height (cm)')
plt.ylabel('Count')
# seaborn
ax = sns.histplot(data=dataset, x='Height', bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
ax.set(title='Histogram', xlabel='Height (cm)', ylabel='Count')
plt.show()
 
#Boxplot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.boxplot(data=dataset, x='Name', y='Age', palette='Set2', linewidth=2.5)
ax.set(title='Boxplot', xlabel='Names', ylabel='Age (Years)')
plt.xticks(rotation=90)
plt.show()
 
#Scatter plot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.scatterplot(data=dataset, x='Age', y='Height',   hue='Name', palette='Set2', s=60)
ax.set(xlabel='Age (Years)', ylabel='Height (cms)')
ax.legend(title='People', title_fontsize = 12)
plt.show()
 
#Something else
pivot = dataset.groupby(['Name'], as_index=False).mean()
relationship = pivot.loc[:,"Age":"Height"]
print(relationship)
 
#Plot some graph
charts = ["bar", "line", "barh", "hist", "box", "kde", "density", "area"]
for chart_type in charts:
    relationship.plot(kind="%s" % chart_type) #Replace bar with line, barh, hist, box, kde, density, area
    plt.title("%s plot" % chart_type)
    plt.show()
 
#Seaborn
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = "Age", y = "Height")
plt.show()
 
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = "Height", y = "Age")
plt.show()
 
#replot
plt.figure()
sns.set_theme(style="darkgrid")
sns.relplot(x="Age", y="Height", hue="Name", data=dataset);
plt.show()
 
#Hexbin
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset["Age"], dataset["Height"], gridsize=nbins, color=colors[:5])
plt.show()
 
#Hexbin 2
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset["Age"], dataset["Height"], gridsize=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#2-D Hist
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset["Age"], dataset["Height"], bins=nbins, color=colors[:6])
plt.show()
 
#2-D Hist 2
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset["Age"], dataset["Height"], bins=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#Set variables
x = dataset["Age"]
y = dataset["Height"]
z = dataset["Name"]
 
#Linear Regression
plt.figure()
sns.regplot(x = x, y = y, data=dataset);
plt.show()
 
plt.figure()
sns.jointplot(x=x, y=y, data=dataset, kind="reg");
plt.show()
 
#Set seaborn style
sns.set_style("white")
 
# Basic 2D density plot
plt.figure()
sns.kdeplot(x=x, y=y)
plt.show()
 
# Custom the color, add shade and bandwidth
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Reds", shade=True, bw_adjust=.5)
plt.show()
 
# Add thresh parameter
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Blues", shade=True, thresh=0)
plt.show()
 
#Joint plot
plt.figure()
sns.jointplot(x = x,y = y,data = dataset,kind = 'hex')
plt.show()

#6

Combining Data Collected in Various Ways

Combine Data into one Master_Sheet Spreadsheet:

Code: [Select all] [Expand/Collapse]

import gspread
import urllib.request
urllib.request.urlretrieve("https://www.dropbox.com/s/m728v370159b2xm/credentials.json?dl=1", "credentials.json")
 
gc = gspread.service_account(filename="credentials.json")
#We can open spreadsheets by Name, or by url or by key (.open(), open_by_url(), open_by_key())
#Let's open our spreadsheets by their keys
wb1 = gc.open_by_key("1BC48PKPZW71AC6hOn1SNesvZk1PoBjE4wXwl1YlWFpY")
wb2 = gc.open_by_key("1UnkRYcOhLFMgyT_CzByvupvdaD5cL5b_nCcOeoy1uy8")
wb3 = gc.open_by_key("1E2WYRF9nAxtmUW57Ng05ZsH_0R5UZ0BLLaLcAy9tJzM")
 
#Let's rename some the headers in workbook 1
ws1 = wb1.sheet1
import numpy as np
array = np.array([["Name", "Age", "Height", "Date"]])
# Write the array to worksheet starting from the cell B1
ws1.update('B1', array.tolist())
 
#Let's create a new sheet named Master_sheet under the spreadsheet ODK Test Spreadsheet
#newsheet = wb1.add_worksheet(title="Master_Sheet", rows="1024", cols="20")
 
#We can create under the ODK Test Spreadsheet Google sheet copies of the Google form and manually collected data
#newsheet = wb1.add_worksheet(title="From_Gsurvey_form", rows="1024", cols="20")
#newsheet = wb1.add_worksheet(title="Collected_Manually", rows="1024", cols="20")
 
#Let's copy data from the two sheets in different spreadsheets to the two newly created sheets under ODK Test Spreadsheet
ws2 = wb2.sheet1
ws3 = wb3.sheet1
 
sheet1 = ws1.get_all_values()
sheet2 = ws2.get_all_values()
sheet3 = ws3.get_all_values()
 
#Dataframes
import pandas as pd
df1 = pd.DataFrame.from_records(sheet1[1:], columns=sheet1[0])
df2 = pd.DataFrame.from_records(sheet2[1:], columns=sheet2[0])
df3 = pd.DataFrame.from_records(sheet3[1:], columns=sheet3[0])
 
all_sheets = pd.concat([df1, df2, df3])
 
#Drop all NANs
clean_df = all_sheets.fillna("") 
#Let's create a list of lists for the dataframes including columns
combined_Data = [clean_df.columns.tolist()] + clean_df.to_numpy().tolist()
 
mastersheet = wb1.worksheet("Master_Sheet")
 
#Let's update the Master Sheet, start pasting from cell A1
mastersheet.update("A1", combined_Data, value_input_option="USER_ENTERED")
 
import os
os.remove("credentials.json")

Data extracted from paper Questionnaire

Data collected via Google Survey form

Data Collected using ODK Collect

The combined dataset is contained in the Master_Sheet of the spreadsheet above.

Visualize the Combined Data:

Code: [Select all] [Expand/Collapse]

#Plot some graph
#Import required libraries
import gspread
import urllib.request
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
 
#Let's Read Data from Google Sheets into Pandas without the Google Sheets API
#Useful for multiple sheets
sheet_id = "1BC48PKPZW71AC6hOn1SNesvZk1PoBjE4wXwl1YlWFpY"
sheet_name = "Master_Sheet"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
 
#Get Pandas dataframe
dataset = pd.read_csv(url)
 
#worksheet = sh.sheet1
#age = worksheet.col_values(3)[1:]
age = dataset["Age"]
print("Ages:", age)
#Height = worksheet.col_values(4)[1:]
Height = dataset["Height"]
print("Heights:", Height)
 
#Pandas is extremely very useful for Google spreadsheets
#Convert the json to Pandas dataframe
#Get all data records as dictionary
#data = worksheet.get_all_records()
#df = pd.DataFrame.from_dict(data)
 
#Let's get some statistics
#age_arr = np.array(age)
#age_array = age_arr.astype(float)
#h_arr = np.array(Height)
#h_array = h_arr.astype(float)
 
print("Average Age:", np.mean(age))
print("Mean Height:", np.mean(Height))
print("Minimum and Maximum Age:", np.min(age), np.max(age))
print("Minimum and Maximum Height:", np.min(Height), np.max(Height))
 
#Let's visualize
#Graph styles and font size
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
 
#sns list of color plettes
#print(sns.color_palette('deep'), sns.color_palette("pastel"), sns.color_palette("Set2"))
#If your file only has one sheet, replace sheet_url
#sheet_url = “https://docs.google.com/spreadsheets/d/1XqOtPkiE_Q0dfGSoyxrH730RkwrTczcRbDeJJpqRByQ/edit#gid=0"
#url_1 = sheet_url.replace(‘/edit#gid=’, ‘/export?format=csv&gid=’)
 
#print(dataset)
 
#Names = worksheet.col_values(2)[1:]
#Names = data["Name"]
#print(Names)
 
df_names = dataset["Name"]
df_ages = dataset["Age"]
df_Heights = dataset["Height"]
print(df_names)
 
#Preprocessing
plots = dataset.groupby(['Name'], as_index=False).mean()
#print(plots)
 
#Bar Plot in MatplotLib with plt.bar()
#Names vs Age
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('pastel')
plt.bar(dataset['Name'], dataset['Age'], color=colors[:5])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Age')
plt.title('Barplot')
plt.show()
 
#Name Vs Height
plt.figure()
plt.figure(figsize=(10,5), tight_layout=True)
colors = sns.color_palette('deep')
plt.bar(dataset['Name'], dataset['Height'], color=colors[:6])
plt.xlabel('Name')
plt.xticks(rotation=90)
plt.ylabel('Height')
plt.title('Barplot')
plt.show()
 
#Bar Plot in Seaborn with sns.barplot()
plt.figure(figsize=(10,5), tight_layout=True)
ax = sns.barplot(x=dataset['Name'], y=dataset['Age'], palette='pastel', ci=None)
ax.set(title='Barplot with Seaborn', xlabel='Names', ylabel='Age')
plt.xticks(rotation=90)
plt.show()
 
#Barplot grouped data by "n" variables
plt.figure(figsize=(12, 6), tight_layout=True)
ax = sns.barplot(x=dataset['Age'], y=dataset['Height'], hue=dataset['Name'], palette='pastel')
ax.set(title='Age vs Height' ,xlabel='Age', ylabel='Height')
ax.legend(title='Names', title_fontsize='13', loc='upper right')
plt.show()
 
#Histograms with plt.hist() or sns.histplot()
plt.figure(figsize=(10,6), tight_layout=True)
bins = [40, 50, 60, 70, 80]
# matplotlib
plt.hist(dataset['Height'], bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
plt.title('Histogram')
plt.xlabel('Height (cm)')
plt.ylabel('Count')
# seaborn
ax = sns.histplot(data=dataset, x='Height', bins=bins, color=sns.color_palette('Set2')[2], linewidth=2)
ax.set(title='Histogram', xlabel='Height (cm)', ylabel='Count')
plt.show()
 
#Boxplot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.boxplot(data=dataset, x='Name', y='Age', palette='Set2', linewidth=2.5)
ax.set(title='Boxplot', xlabel='Names', ylabel='Age (Years)')
plt.xticks(rotation=90)
plt.show()
 
#Scatter plot
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.scatterplot(data=dataset, x='Age', y='Height',   hue='Name', palette='Set2', s=60)
ax.set(xlabel='Age (Years)', ylabel='Height (cms)')
ax.legend(title='People', title_fontsize = 12)
plt.show()
 
#Something else
pivot = dataset.groupby(['Name'], as_index=False).mean()
relationship = pivot.loc[:,"Age":"Height"]
print(relationship)
 
#Plot some graph
charts = ["bar", "line", "barh", "hist", "box", "kde", "density", "area"]
for chart_type in charts:
    relationship.plot(kind="%s" % chart_type) #Replace bar with line, barh, hist, box, kde, density, area
    plt.title("%s plot" % chart_type)
    plt.show()
 
#Seaborn
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = "Age", y = "Height")
plt.show()
 
plt.figure()
sns.set_style("darkgrid")
sns.lineplot(data = dataset, x = "Height", y = "Age")
plt.show()
 
#replot
plt.figure()
sns.set_theme(style="darkgrid")
sns.relplot(x="Age", y="Height", hue="Name", data=dataset);
plt.show()
 
#Hexbin
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset["Age"], dataset["Height"], gridsize=nbins, color=colors[:5])
plt.show()
 
#Hexbin 2
#Split the plotting window into 20 hexbins
plt.figure()
nbins = 15
plt.title('Hexbin')
plt.hexbin(dataset["Age"], dataset["Height"], gridsize=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#2-D Hist
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset["Age"], dataset["Height"], bins=nbins, color=colors[:6])
plt.show()
 
#2-D Hist 2
plt.figure()
plt.title('2-D Histogram')
plt.hist2d(dataset["Age"], dataset["Height"], bins=nbins, cmap=plt.cm.BuGn_r)
plt.show()
 
#Set variables
x = dataset["Age"]
y = dataset["Height"]
z = dataset["Name"]
 
#Linear Regression
plt.figure()
sns.regplot(x = x, y = y, data=dataset);
plt.show()
 
plt.figure()
sns.jointplot(x=x, y=y, data=dataset, kind="reg");
plt.show()
 
#Set seaborn style
sns.set_style("white")
 
# Basic 2D density plot
plt.figure()
sns.kdeplot(x=x, y=y)
plt.show()
 
# Custom the color, add shade and bandwidth
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Reds", shade=True, bw_adjust=.5)
plt.show()
 
# Add thresh parameter
plt.figure()
sns.kdeplot(x=x, y=y, cmap="Blues", shade=True, thresh=0)
plt.show()
 
#Joint plot
plt.figure()
sns.jointplot(x = x,y = y,data = dataset,kind = 'hex')
plt.show()

#7

Further: Here the data in Sheet1 and Sheet2 have been combined by column names to a data in the Master_Sheet:

Read data from several Google sheets/worksheets and create Pandas dataframe for each.
Combine the dataframes created in the previous step.
Clean data to remove unrequired values.
Store the data from the combined dataframe into a new Google sheet -- Master Sheet.

In contrast to the previous topics, we open our spreadsheet by URL:

Code: [Select all] [Expand/Collapse]

import gspread
import urllib.request
urllib.request.urlretrieve("https://www.dropbox.com/s/m728v370159b2xm/credentials.json?dl=1", "credentials.json")
 
gc = gspread.service_account(filename="credentials.json")
wb = gc.open_by_url("https://docs.google.com/spreadsheets/d/1O19ketp1DpZwUytV_WqAlxafOBfwiCHiR-0F5eTir6k/edit#gid=0") #Open spreadsheet,
 
#print(wb.worksheets()) #Try wb.worksheets
#print(wb.title)
import pandas as pd
list_of_dataframes = []
for ws in wb.worksheets():
    if ws.title != "Master_Sheet":
        rows = ws.get_all_values()
        #Create data frames
        df = pd.DataFrame.from_records(rows[1:], columns=rows[0]) #Omit the first/header row, pass columns
        list_of_dataframes.append(df)
 
#Print the first dataframe
#print(list_of_dataframes[0])
 
combined_dataframes = pd.concat(list_of_dataframes)
print(combined_dataframes)
#Do some cleaning, replace NAN with blank string
clean_df = combined_dataframes.fillna("") #Try combined_dataframes.dropna("")
 
#Let's create a list of lists for the dataframes including columns
combined_Data = [clean_df.columns.tolist()] + clean_df.to_numpy().tolist()
 
mastersheet = wb.worksheet("Master_Sheet")
 
#Let's update the Master Sheet, start pasting from cell A1
mastersheet.update("A1", combined_Data, value_input_option="USER_ENTERED")
 
import os
os.remove("credentials.json")

#8

Data Cleaning

The Python snippet below performs the following tasks:

Read data from the spreadsheet by using spreadsheet ID, create a worksheet, and then create a Pandas DataFrame (df1) from the worksheet.
Preprocesses data by creating the second Pandas DataFrame (df2) for which all strings named HIV/AIDS are removed from the column named "Ailment cured", similarly, it renames all strings named "Gonorrhoea, syphilis" to "Gonorrhoea & Syphilis". Note that Python is case-sensitive.
It creates a new spreadsheet and names it "A New Test Spreadsheet". It also creates a worksheet (worksheet2) within the newly created spreadsheet.
It updates worksheet2 by copying the Pandas DataFrame df2 into it.
Finally, it sends an updated spreadsheet to an email.

Code: [Select all] [Expand/Collapse]

import gspread
import urllib.request
urllib.request.urlretrieve("https://www.dropbox.com/s/m728v370159b2xm/credentials.json?dl=1", "credentials.json")
 
gc = gspread.service_account(filename="credentials.json")
sh = gc.open_by_key("1pm1mGdRgpitrYQiGqUNSHPdR43e-ZSXCavYr-TcqtwU") #Open spreadsheet by ID (please change ID)
worksheet = sh.sheet1
 
import pandas as pd
 
df1 = pd.DataFrame(worksheet.get_all_records())
#Preprocess Data
df2 = df1[~df1['Ailment cured'].isin(['HIV/AIDS'])]
#df['Ailment cured'] = df['Ailment cured'].replace({'Gonorrhoea, syphilis':'Gonorrhoea & Syphilis'})
df2["Ailment cured"] = df2['Ailment cured'].replace('Gonorrhoea, syphilis', 'Gonorrhoea & Syphilis')
print(df2)
 
#Let's create a new blank spreadsheet:
 
sh2 = gc.create('A New Test Spreadsheet Created During Webinar')
worksheet2 = sh2.sheet1
 
#Let's write df2 to a new worksheet
worksheet2.update([df2.columns.values.tolist()] + df2.values.tolist())
#Share the new worksheet to an email:
sh2.share('ey@tssfl.co', perm_type='user', role='writer')
 
#Finally,
import os
os.remove("credentials.json")
#We can also combine/concatenate sheets

#9

Report Generation - An example of Sales Data

Code: [Select all] [Expand/Collapse]

#Import the required libraries
from __future__ import print_function
import pandas as pd
import numpy as np
import urllib.request
#print("This is Pandas version:", pd.__version__)
 
#Download and read the data into DataFrame
df = urllib.request.urlretrieve("https://www.tssfl.com/download/file.php?id=1188", "sales-sheet.xlsx")
df = pd.read_excel("./sales-sheet.xlsx")
#print(df)
#print(df.head())
 
#Pivot the data to summarize.
sales_report = pd.pivot_table(df, index=["Manager", "Rep", "Product"], values=["Price", "Quantity"],
                           aggfunc=[np.sum, np.mean], fill_value=0)
#print(sales_report.head())
 
#Show some statistics, for example, the average quantity and price for CPU and Software sales
print(df[df["Product"]=="CPU"]["Quantity"].mean())
print(df[df["Product"]=="CPU"]["Price"].mean())
print(df[df["Product"]=="Software"]["Quantity"].mean())
print(df[df["Product"]=="Software"]["Price"].mean())
 
#Let's create a Jinja environment and get the report.html template
 
from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader('.'))
urllib.request.urlretrieve("https://www.dropbox.com/s/djpii5trdesfb4x/report.html?dl=1", "report.html")
template = env.get_template("./report.html")
 
"""env variable above shows how we pass content to the template
We create a dictionary called temp_variables that
contains all the variable we want to pass to the template"""
 
temp_variables = {"title" : "Sales Report",
                 "pivot_table": sales_report.to_html()}
 
#Finally, let's render the HTML with the variables included in the output
#This will create a string that we will eventually pass to our PDF creation engine - WeasyPrint
 
html_output = template.render(temp_variables)
 
#Generate PDF
#We create a pdf by passing string to the PDF generator
from weasyprint import HTML
HTML(string=html_output).write_pdf("report1.pdf")
 
#Apply stylesheet
ss = urllib.request.urlretrieve("https://www.dropbox.com/s/xd7kk9t17sjfwrr/style.css?dl=1", "style.css")
HTML(string=html_output).write_pdf("report2.pdf", stylesheets=["./style.css"])
 
import os
os.remove("sales-sheet.xlsx")
os.remove("report.html")
os.remove("style.css")

#10

Some data and codes that were used during webinar are not shown here.

TSSFL TECHNOLOGY STACK

First Public Service Webinar: Accelerated Research Solutions with Digital Technologies

Who is online

First Public Service Webinar: Accelerated Research Solutions with Digital Technologies

Who is online

Login • Register