google-cloud-dataproc's questions - Swedish 1answer

0 google-cloud-dataproc questions.

I am trying to create a cluster in Dataproc using google-cloud-python library, however, when setting region = 'us-central1' I get below exception: google.api_core.exceptions.InvalidArgument: 400 ...

Here i am writing queries in queryList which is under hiveJob. To submit Hive job to dataproc cluster def submit_hive_job(dataproc, project, region, cluster_name): ...

Google dataproc one node cluster, VCores Total = 8. I've tried from user spark: /usr/lib/spark/sbin/start-thriftserver.sh --num-executors 2 --executor-cores 4 tried to change /usr/lib/spark/conf/...

I am working on a Data Proc Spark cluster with an initialization action to install Jupyter notebook. I am unable to read the csv files stored on the google cloud storage bucket, however I am able to ...

My jobs (ML jobs) require more than 15GB RAM per worker. How to change the machine type for worker? Currently: n1-standard-4 (4 vCPU, 15.0 GB memory) I would prefer to keep my cluster not ...

I have a GCP dataproc cluster where I'm running a job. The input of the job is a folder where there are 200 part files. Each part file is approx 1.2 GB big. My job is just map operations val df = ...

What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured.

The below commands works fine in my terminal: gcloud logging read "logName=projects/logs/java.log AND labels.component=projet1 AND textPayload=\"End: of query.\" But returns null inputstream while ...

I am getting this error when running the Spotify Spark Bigquery connector on Qubole data platform. I do see the BigQueryUtils class in my jar but still it throws this error: Exception in thread "...

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file. In order to submit ...

I have created and tested successfully a project which runs locally with docker-compose using sample data. Bash file to run the whole pipeline job: cp -r ../data . # transfer data used for job ...

This is my current Hadoop job. java -cp `hadoop classpath`:/usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \ --python_venv=/usr/local/src/jobs/MyJob/mnist_venv....

How can we Visualize (via Dashboards) the Dataproc job status in Google Cloud Platform . We want to check if Jobs if running or not, in addition of their status like running, delay, blocked.On top of ...

I am desperately trying to make a simple program to load data from BigQuery to a Spark dataframe. The Google's Dataproc pyspark example doesn't work, further I followed these links: BigQuery ...

Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily ...

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there: conf = { # Input ...

Is there a way of submitting HDFS commands on a Dataproc cluster, if you can't SSH into the master node. I couldn't find anything in the Gcloud SDK or the rest API. So something like : gcloud ...

I have adapted the instructions at Use the BigQuery connector with Spark to extract data from a private BigQuery object using PySpark. I am running the code on Dataproc. The object in question is a ...

I have followed Use the BigQuery connector with Spark to successfully get data from a publicly available dataset. I now need to access a bigquery dataset that is owned by one of our clients and for ...

I have a table in BigQuery that I want to query and implement FPgrowth algorithm. I want to try it first on the pyspark shell using a VM instance of the dataproc cluster. I am looking for a way to ...

Google Cloud Dataproc provides initialization scripts for many frameworks, including Kafka, Zeppelin etc, there is no default script for Cassandra, was wondering if there is one scripted already by ...

I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not ...

After a dataproc cluster is created, many jobs are submitted automatically to ResourceManager by user dr.who. This is starving the resources of the cluster and eventually overwhelms the cluster so. ...

I'm working with ephemeral GCP Dataproc clusters ( Apache Spark 2.2.1, Apache Hadoop 2.8.4 and Apache Hive 2.1.1). These clusters all point to the same Hive Metastore (hosted on a Google Cloud SQL ...

I am running a Spark 2.2 job on Dataproc and I need to access a bunch of avro files located in a GCP storage bucket. To be specific, I need to access the files DIRECTLY from the bucket (i.e. NOT first ...

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve ...

I am performing some operation using DataProcPySparkOperator. This operator is only taking a cluster name as parameter, there is no option to specify region and by default it considers cluster with ...

In the Google Cloud Dataproc beta what are the versions of Spark and Hadoop? What version of Scala is Spark compiled for?

@dennis-huo Using non-default service account in Google Cloud dataproc In continuation to the above problem I wanted to setup a dataproc cluster for multi user. Since the compute engine of Dataproc ...

How to check number of Dataproc clusters in use at any given time in Google Cloud Platform ? If Yes,we need the way to Visualize that in GCP as well .

I'd like to create a dataproc cluster that runs under a non-default service account. The following works for a compute instance: gcloud compute instances create instance-1 --machine-type "n1-standard-...

We're using Spark Thrift Server as a long-running service for ad-hoc SQL queries, instead of Hive/Tez. This is working out fairly well, except that every few days it starts filling up the disk on ...

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly ...

How do I run more than one spark streaming job in dataproc cluster? I created multiple queues using capacity-scheduler.xml but now I will need 12 queues if I want to run 12 different streaming - ...

I am trying to run a structured streaming application which writes the output files as parquet to Google cloud storage. I don't see any errors. But it does not write the files to GCS location. I could ...

I have a GCP Dataproc cluster with 50 workers (n1-standard-16 16 VCores 64 GB RAM). The cluster has Capacity Scheduler with Default Resource Calculator. My Spark job has following configuration ...

I have dataproc setup on google cloud platform with apache livy installed. I am submitting jobs using livy rest api. When I try to kill livy jobs from Yarn RM, I am getting below error in browser ...

I am using GCP/Dataproc for some spark/graphframe calculations. In my private spark/hadoop standalone cluster, I have no issue using functools.partial when defining pysparkUDF. But, now with GCP/...

I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit ...

I am new to BigQuery GCP and to access BigQuery data we are using Spotify spark-bigquery connector as provided here. We are able to use sqlContext.bigQueryTable("project_id:dataset.table") and its ...

According to the Dataproc docos, it has "native and automatic integrations with BigQuery". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc ...

I am considering adding a set of preemptable instances to the Worker pool of a Spark job that I run on Google Could Dataproc, but I am trying to understand what exactly would happen in case some of ...

I am getting error while trying to submit a pyspark job in dataproc cluster. Gcloud submission command: gcloud dataproc jobs submit pyspark --cluster test-cluster migrate_db_table.py But ...

I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much ...

Using Spark on GCP Dataproc, I successfuly write an entire RDD to GCS like so: rdd.saveAsTextFile(s"gs://$path") The products are files for each partition in the same path. How do I write files for ...

Working with dataproc and i was exploring different configuration related to spark and yarn, and i found that dataproc includes GC_OPTS="-XX:+UseConcMarkSweepGC" as part of yarn env. configuration. ...

I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed. I have a replication error writing to Tachyon from Spark. Is there any way to specify it ...

I want to run Presto on a Dataproc instance or on Google Cloud Platform in general. How can I easily setup and install Presto, especially with Hive?

I am trying to handle the somewhat big data for Kaggle Competition. The amount of the data to handle is about 80Gb and it has 2 billion rows x 6 columns. The data was put in Google Cloud Storage ...

Due to some mix-up during planning we ended up with several worker nodes running 23TB drives which are now almost completely unused (we keep data on external storage). As the drives are only wasting ...

Related tags

Hot questions

Language

Popular Tags