UNCW MS Computer Science Information Systems Proceedings
A Data Pipeline for Amazon Review Collection and Preparation
Ryan Woodall
Douglas Kline (Chair)
Ron Vetter
Minoo Modaresnezhad
Abstract
Modern companies struggle with big data collection and processing, and it has become best practice to accomplish this with data pipelines in the cloud. Knowledge of the tools for building a data pipeline that is maintainable, adaptable, repeatable, and scalable, would be an incredible asset to any aspiring data engineer. By using Azure tools, I can accomplish this task, while utilizing data from Amazon Reviews as an example. Azure Data Factory provides an intuitive environment to integrate data visually, constructing ETL and ELT processes. I wish to learn and explore the use of these technologies, and this is the motivation for this project. By using the problem of amazon reviews, I can demonstrate the design and development of a data pipeline while supplying a data science effort using large amounts of data. The advantages of this approach are that it is perfectly repeatable, easily modifiable, and scalable, which is not the case in typical manual processes. Additionally, I feel that it will create datasets that many people will want to use in future research endeavors.
Download Full PDF
Recommended Citation: Woodall R., Kline D., Vetter R., Modaresnezhad M., (2021). A Data Pipeline for Amazon Review Collection and Preparation.
UNCW MS CSIS Proceedings.
V. 15
, N. 3
.
This article was accepted for publication/presenation:
2021 Proceedings of the Conference on Information Systems Applied Research, Washington DC
Click for Document