Essays about: "Dataprocessering"

Found 4 essays containing the word Dataprocessering.

  1. 1. A Comparative Study on Efficiency and Scalability of Integer and String Datasets in cuDF and pandas

    University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Author : Anton Schulz; Emil Sjölander; [2023]
    Keywords : ;

    Abstract : This thesis presents a comparative analysis of cuDF and pandas, two Python data processing libraries, with a focus on performance, limitations, and scalability when handling integer and string datasets. The study aims to assess the efficiency and suitability of cuDF as a potential alternative to pandas in scenarios where high-performance data processing is required. READ MORE

  2. 2. Highly Available Task Scheduling in Distinctly Branched Directed Acyclic Graphs

    University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Author : Patrik Zhong; [2023]
    Keywords : Distributed Scheduling; Fault-tolerance; Graph Partitioning; Task Graphs; Dask; Dask Distributed; Data Processing; Distribuerad Schemaläggning; Feltolerans; Grafpartitionering; Uppgiftsgrafer; Dask; Dask Distributed; Dataprocessering;

    Abstract : Big data processing frameworks utilizing distributed frameworks to parallelize the computing of datasets have become a staple part of the data engineering and data science pipelines. One of the more known frameworks is Dask, a widely utilized distributed framework used for parallelizing data processing jobs. READ MORE

  3. 3. Scaling cloud-native Apache Spark on Kubernetes for workloads in external storages

    University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

    Author : Piotr Mrowczynski; [2018]
    Keywords : Cloud Computing; Spark on Kubernetes; Kubernetes Operator; Elastic Re- source Provisioning; Cloud-Native Architectures; Openstack Magnum; Data Mining; Cloud Computing; Spark över Kubernetes; Kubernetes Operator; Elastic Re- source Provisioning; Cloud-Native Architectures; Openstack Magnum; Containers; Data Mining;

    Abstract : CERN Scalable Analytics Section currently offers shared YARN clusters to its users as monitoring, security and experiment operations. YARN clusters with data in HDFS are difficult to provision, complex to manage and resize. This imposes new data and operational challenges to satisfy future physics data processing requirements. READ MORE

  4. 4. Integrating Pig and Stratosphere

    University essay from KTH/Skolan för informations- och kommunikationsteknik (ICT)

    Author : Vasiliki Kalavri; [2012]
    Keywords : ;

    Abstract : MapReduce is a wide-spread programming model for processing big amounts of data in parallel. PACT is a generalization of MapReduce, based on the concept of Parallelization Contracts (PACTs). Writing efficient applications in MapReduce or PACT requires strong programming skills and in-depth understanding of the systems’ architectures. READ MORE