Page 1 of 1

PySpark for Large Data Processing

Posted: Thu May 09, 2024 9:22 pm
by Eli
PySpark is the Python API for Apache Spark, which is an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Here is a PySpark tutorial:


Re: PySpark for Large Data Processing

Posted: Sat May 18, 2024 12:40 pm
by Eli
Related tools are DuckDB, Pandas, Polar, Dataclasses, Pydantic, and Joblib.