We have a JSON File and we need to create the Data frame with that JSON File , we can use the below python code to create a data frame from a JSON File
Code
Dataframe_df=spark.read \
.option("header" , True) \
.option("inferschema" , True) \
.json("/FileStore/dataset/constructors.json")
use the below command to see the schema of a Data Frame
Code
Dataframe_df.printSchema()
Once the Data Frame is Created we can use the below command to check the content of the Data Frame which was loaded with the JSON File
Code
display(Dataframe_df)
we can use the below command If we need to remove a column of a data frame
Code
Dataframe_df_Dropped=Dataframe_df.drop('url')
after deleting the Column we can Display the content of the Data Frame , Now the column "URL" is removed from the data Frame
Code
display(Dataframe_df_Dropped)
We can use the below command if we want to remove multiple columns from a Data Frame
Code
Dataframe_df_Drop_multiple_columns=Dataframe_df.drop('url','name')
You can again see the structure and the content of a Data Frame after removing muliple columns
Code
display(Dataframe_df_Drop_multiple_columns)
We can use the below command if we want to Rename a column in a Data Frame
Code
Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id")
Use the below command if you need to rename multiple coumn
Code
Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id") \
.withColumnRenamed("constructorRef","Constructor_Ref") \
.withColumnRenamed("nationality","Nationality")
Below is the code to Rename multiple columns and if we wish to add a new column with Current timestamp
Code
from pyspark.sql.functions import current_timestamp
Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id") \
.withColumnRenamed("constructorRef","Constructor_Ref") \
.withColumnRenamed("nationality","Nationality") \
.withColumn("Ingestion_date",current_timestamp())
0 تعليقات