V15 N1 Paper 3
|
Annals of the MS in Computer Science and Information Systems at
UNC Wilmington
|
Spring 2021
|
A Data Pipeline for Amazon Review Collection and Preparation
Ryan Woodall
Committee
Abstract
Modern companies struggle with big data collection and processing, and it has become best practice to accomplish this with data pipelines in the cloud. Knowledge of the tools for building a data pipeline that is maintainable, adaptable, repeatable, and scalable, would be an incredible asset to any aspiring data engineer. By using Azure tools, I can accomplish this task, while utilizing data from Amazon Reviews as an example. Azure Data Factory provides an intuitive environment to integrate data visually, constructing ETL and ELT processes. I wish to learn and explore the use of these technologies, and this is the motivation for this project. By using the problem of amazon reviews, I can demonstrate the design and development of a data pipeline while supplying a data science effort using large amounts of data. The advantages of this approach are that it is perfectly repeatable, easily modifiable, and scalable, which is not the case in typical manual processes. Additionally, I feel that it will create datasets that many people will want to use in future research endeavors.
download
(pdf)
Recommended Citation:
Woodall, R., Kline, D, Vetter, R., Modaresnezhad, M. (2021) A Data Pipeline for Amazon Review Collection and Preparation. Annals of the Master of Science in Computer Science and Information Systems at UNC Wilmington, 15(1) paper 3. http://csbapp.uncw.edu/data/mscsis/full.aspx.
V15 N1 Paper 3
|
Annals of the MS in Computer Science and Information Systems at
UNC Wilmington
|
Spring 2021
|