Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
269 views
in Technique[技术] by (71.8m points)

python - PySpark: dynamic union of DataFrames with different columns

Consider the arrays as shown here. I have 3 sets of array:

Array 1:

C1  C2  C3
1   2   3
9   5   6

Array 2:

C2 C3 C4
11 12 13
10 15 16

Array 3:

C1   C4
111  112
110  115

I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero.

Expected output:

C1 C2 C3 C4
1  2  3  0
9  5  6  0
0  11 12 13
0 10 15 16
111 0 0 112
110 0 0 115

I have written pyspark code but I have hardcoded the value for the new column and its RAW, I need to convert the below code to method overloading, so that I can use this script as automatic one. I need to use only python/pyspark not pandas.

import pyspark
from pyspark import SparkContext
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession

sqlContext = pyspark.SQLContext(pyspark.SparkContext())

df01 = sqlContext.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = sqlContext.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = sqlContext.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

df01_add = df01.withColumn("C4", lit(0)).select("c1","c2","c3","c4")
df02_add = df02.withColumn("C1", lit(0)).select("c1","c2","c3","c4")
df03_add = df03.withColumn("C2", lit(0)).withColumn("C3", lit(0)).select("c1","c2","c3","c4")

df_uni = df01_add.union(df02_add).union(df03_add)
df_uni.show()

Method Overloading Example:

class Student:
     def ___Init__ (self,m1,m2):
         self.m1 = m1
         self.m2 = m2

     def sum(self,c1=None,c2=None,c3=None,c4=None):
         s = 0
         if c1!= None and c2 != None and c3 != None:
            s = c1+c2+c3
         elif c1 != None and c2 != None:
             s = c1+c2
         else:
            s = c1
         return s

print(s1.sum(55,65,23))
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There are probably plenty of better ways to do it, but maybe the below is useful to anyone in the future.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder
    .appName("DynamicFrame")
    .getOrCreate()

df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

dataframes = [df01, df02, df03]

# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
    for x in df.columns:
        cols.add(x)
cols = sorted(cols)

# Create a dictionary with all the dataframes
dfs = {}
for i, d in enumerate(dataframes):
    new_name = 'df' + str(i)  # New name for the key, the dataframe is the value
    dfs[new_name] = d
    # Loop through all column names. Add the missing columns to the dataframe (with value 0)
    for x in cols:
        if x not in d.columns:
            dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
    dfs[new_name] = dfs[new_name].select(cols)  # Use 'select' to get the columns sorted

# Now put it al together with a loop (union)
result = dfs['df0']      # Take the first dataframe, add the others to it
dfs_to_add = dfs.keys()  # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
for x in dfs_to_add:
    result = result.union(dfs[x])
result.show()

Output:

+---+---+---+---+
| C1| C2| C3| C4|
+---+---+---+---+
|  1|  2|  3|  0|
|  9|  5|  6|  0|
|  0| 11| 12| 13|
|  0| 10| 15| 16|
|111|  0|  0|112|
|110|  0|  0|115|
+---+---+---+---+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...