Big Data training infrastructure: local Proxmox cluster for ETL and Spark SQL

Authors

  • Yurii Oleksandrovych Sitsylitsyn PhD, Senior Lecturer at the Department of Informatics and Cybernetics, Bogdan Khmelnitsky Melitopol State Pedagogical University, 69000, Zaporizhzhia, Naukovogo Mistechka St., 59, Ukraine https://orcid.org/0000-0002-3888-5575
  • Dmytro Viktorovych Lubko Candidate of Technical Sciences, Associate Professor, Department of Computer Science, Dmytro Motornyi Tavria State Agrotechnological University, 69600, Zaporizhzhia, Zhukovsky St., 66, Ukraine https://orcid.org/0000-0002-2506-4145

DOI:

https://doi.org/10.5281/zenodo.17195759

Keywords:

Big Data, Hadoop, Spark, Proxmox, teaching infrastructure, reproducibility, median, IQR, ETL, Spark SQL, educational indicators.

Abstract

Abstract: The article proposes a methodologically grounded approach to deploying and pedagogically validating a local Big Data teaching infrastructure based on Proxmox with a Hadoop/Spark cluster for conducting laboratory work in data-processing courses. Unlike the author’s previous publication, which described the architecture of the virtualized environment and access organization, this study focuses on a minimal reproducible experimental protocol that directly links technical performance metrics with clear educational indicators of instructional quality. The proposed protocol includes a standardized sequence of tasks (ETL → Spark SQL → stage analysis), version pinning of components (operating system, JDK, Hadoop, Spark), controlled input data, and unified student instructions. To aggregate results, we employ robust statistics – median and interquartile range (IQR) – as well as stage execution profiles, which reduces the impact of outliers and ensures interpretability of measurements within an academic class session. Pedagogical validation is carried out through operationalized indicators: predictability of class timing (the share of the group that completes work within the allotted slot), transparency of artifacts (the ability to verify progress via logs/notebooks/reports), number of technical incidents, and perceived clarity of instructions via a short survey. We also examine an education-oriented comparison framework of “local infrastructure vs cloud”: total course costs, stability and controllability of task execution, dependence on external services, support requirements, and accessibility for students with varying levels of preparation. Empirical results show that a local Proxmox-based cluster provides better controllability and more stable time characteristics for typical tasks without sacrificing technical representativeness – an essential factor for planning and assessing learning activities. The practical contribution lies in formalizing a reproducible minimum of Big Data experiments for in-class and blended formats, aligning technical metrics with educational indicators, and providing instructions suitable for course scaling and cross-course comparisons. The study’s limitations concern the cluster size and the set of tasks; future work includes automating metric collection, expanding the data corpus, and validating the approach across different academic programs. Keywords: Big Data, Hadoop, Spark, Proxmox, teaching infrastructure, reproducibility, median, IQR, ETL, Spark SQL, educational indicators.

Published

2025-09-24

How to Cite

Sitsylitsyn, Y. O., & Lubko, D. V. (2025). Big Data training infrastructure: local Proxmox cluster for ETL and Spark SQL. Pedagogical Academy: Scientific Notes, (22). https://doi.org/10.5281/zenodo.17195759

Issue

Section

Information and communication technologies in education