Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

developer tip

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

optionbox 2020. 9. 23. 07:31

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

DataFrameSpark-Scala에서 모든 헤더 / 열 이름을 변환하려고합니다 . 현재로서는 단일 열 이름 만 대체하는 다음 코드가 나옵니다.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

구조가 평평한 경우 :

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

가장 간단한 toDF방법 은 방법 을 사용 하는 것입니다.

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

개별 열의 이름을 바꾸려면 다음 select과 함께 사용할 수 있습니다 alias.

df.select($"_1".alias("x1"))

여러 열로 쉽게 일반화 할 수 있습니다.

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

또는 withColumnRenamed:

df.withColumnRenamed("_1", "x1")

와 함께 사용하여 foldLeft여러 열의 이름을 바꿉니다.

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

중첩 된 구조 ( structs)에서 가능한 옵션 중 하나는 전체 구조를 선택하여 이름을 바꾸는 것입니다.

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

nullability메타 데이터에 영향을 미칠 수 있습니다 . 또 다른 가능성은 캐스팅하여 이름을 바꾸는 것입니다.

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

또는:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

merchants_df_renamed = merchants_df.toDF(
    'merchant_id', 'category', 'subcategory', 'merchant')

merchants_df_renamed.printSchema()

Result:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

참고URL : https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala

'developer tip' 카테고리의 다른 글

파일 이름에 대한 빠른 검색 Android Studio (0)	2020.09.23
C #에서 Base64 URL 안전 인코딩을 달성하는 방법은 무엇입니까? (0)	2020.09.23
Google Maps API의 "내 위치"버튼 위치 변경 (0)	2020.09.23
C #을 포함한 문자열에서 HTML 태그 제거 (0)	2020.09.23
줄 바꿈으로 구분 된 파일을 읽고 줄 바꿈을 버리는 가장 좋은 방법은 무엇입니까? (0)	2020.09.23

현재글Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

optionbox

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바