Big Data training infrastructure: local Proxmox cluster for ETL and Spark SQL
DOI:
https://doi.org/10.5281/zenodo.17195759Keywords:
Big Data, Hadoop, Spark, Proxmox, teaching infrastructure, reproducibility, median, IQR, ETL, Spark SQL, educational indicators.Abstract
Abstract: The article proposes a methodologically grounded approach to deploying and pedagogically validating a local Big Data teaching infrastructure based on Proxmox with a Hadoop/Spark cluster for conducting laboratory work in data-processing courses. Unlike the author’s previous publication, which described the architecture of the virtualized environment and access organization, this study focuses on a minimal reproducible experimental protocol that directly links technical performance metrics with clear educational indicators of instructional quality. The proposed protocol includes a standardized sequence of tasks (ETL → Spark SQL → stage analysis), version pinning of components (operating system, JDK, Hadoop, Spark), controlled input data, and unified student instructions. To aggregate results, we employ robust statistics – median and interquartile range (IQR) – as well as stage execution profiles, which reduces the impact of outliers and ensures interpretability of measurements within an academic class session. Pedagogical validation is carried out through operationalized indicators: predictability of class timing (the share of the group that completes work within the allotted slot), transparency of artifacts (the ability to verify progress via logs/notebooks/reports), number of technical incidents, and perceived clarity of instructions via a short survey. We also examine an education-oriented comparison framework of “local infrastructure vs cloud”: total course costs, stability and controllability of task execution, dependence on external services, support requirements, and accessibility for students with varying levels of preparation. Empirical results show that a local Proxmox-based cluster provides better controllability and more stable time characteristics for typical tasks without sacrificing technical representativeness – an essential factor for planning and assessing learning activities. The practical contribution lies in formalizing a reproducible minimum of Big Data experiments for in-class and blended formats, aligning technical metrics with educational indicators, and providing instructions suitable for course scaling and cross-course comparisons. The study’s limitations concern the cluster size and the set of tasks; future work includes automating metric collection, expanding the data corpus, and validating the approach across different academic programs. Keywords: Big Data, Hadoop, Spark, Proxmox, teaching infrastructure, reproducibility, median, IQR, ETL, Spark SQL, educational indicators.Downloads
Published
2025-09-24
How to Cite
Sitsylitsyn, Y. O., & Lubko, D. V. (2025). Big Data training infrastructure: local Proxmox cluster for ETL and Spark SQL. Pedagogical Academy: Scientific Notes, (22). https://doi.org/10.5281/zenodo.17195759
Issue
Section
Information and communication technologies in education
License
Copyright (c) 2025 Юрій Олександрович Сіциліцин, Дмитро Вікторович Лубко

This work is licensed under a Creative Commons Attribution 4.0 International License.