We often end up creating a problem while working on data. So, here are few best practices for data engineering using Snowflake:
1. Transform your data incrementally:
A common mistake novice data engineers make is writing huge SQL statements that join, aggregate, and process many tables, misunderstanding that this is an efficient way to work. That’s it. In practice, the code becomes overly complex, difficult to maintain, and worse, often problematic. Instead, split the transformation pipeline into multiple steps and write the results to an intermediate table. This makes it easier to test intermediate results, simplifies code, and often produces simpler SQL code that runs faster.
2. Load data using COPY or SNOWPIPE:
Approximately80% of data loaded into a data warehouse is ingestedviaregular batchprocesses,orincreasingly as soon asthe data files arrive.I’m here. Using COPY and SNOWPIPE isthefastest and cheapestway to loaddata. So, resistthe temptation toperiodically load data usingother methods(such as queryingexternaltables). In fact,this is another example ofusingthe righttools.
3. Use multiple data models:
Localdata storageissocostly thatit was notpossibleto store multiple copies ofdata,each using a different data model tomeet your requirements. However, when using Snowflake, storetheraw data history in a structured or variant format, clean and fit the datausingthe third normal form or the Data Vault model, and Itmakes sense to storethe final consumabledata inthe Kimball dimensionaldata model. Each data model hasits own advantagesand storingintermediate stepresults hassignificantarchitecturaladvantages. Especially important isthe ability to reload and reprocess the data in the event ofan error.
4. Choose a required Virtual Warehouse size:
Another tip from the Top 3 Snowflake Performance Tuning Tactics, don’t assume an X6-LARGE virtual warehouse will load massive data files any faster than an X-SMALL. Each physical file is loaded sequentially, and it therefore pays to follow the Snowflake File Sizing Recommendations and either split multi-gigabyte files into chunks of 100–250Mb or load multiple concurrent data files in parallel.
5. Keep raw data history:
Unlessthedatacomesfrom a raw data lake, it makes sense to keep raw datahistory. Thisshould ideally be storedin aVARIANT data type to benefit from automatic schema evolution. This meansthat data can be truncatedandreprocessediferrorsare found in the transformationpipeline, providing data scientists with a great source ofrawdata. If you don’t have a needfor machine learningyet, you’llalmostcertainly need it in the next few years,if notnow.
6. Do not use JDBC or ODBC for normal large data loads:
Anotherrecommendation for suitable tools.JDBC or ODBCinterfacesmay besuitable for loading severalmegabytes of data,butthese interfacescannot handlethe massive throughput of COPY and SNOWPIPE.Use it,butdon’t use itfornormallarge data loads.
7. Avoid scanning files:
Whencollecting datausing the COPYcommand,use partitionedstagingdata filesasdescribedin Step1of SnowflakeTop 3 PerformanceOptimizationTactics. This reduces the effort of scanning large numbers of data files in cloud storage.
8. Use the Tool according to requirement:
As the quote abovesuggests,if you only know one tool,it’s being used improperly. The decisionisbasedonseveralfactors, such astheskills available on yourteam, whether you needfast, near-real-timedelivery, whetheryou are performingaone-timedataload,or aprocess that repeats on aregularbasis. must be Note thatSnowflake can natively handle avarietyof file formats including Avro, Parquet, ORC,JSON,andCSV. Please see online documentation for detailed instructions loading data intoSnowflake.
9. Ensure 3rd party tools push down:
ETL tools like Ab Initio, Talend and Informatica were originally designed to extract data from source systems into an ETL server, transform the data and write them to the warehouse. As Snowflake can draw upon massive on-demand compute resources and automatically scale out, it makes no sense to have data copied to an external server. Instead, use the ELT (Extract, Load and Transform) method, and ensure the tools generate and execute SQL statements on Snowflake to maximize throughput and reduce costs.
10. Using Query Tag:
Whenstarting amulti-stepconversion task,set the session query tagusingALTER SESSION SET QUERY_TAG =“XXXXXX”and ALTER SESSION UNSET QUERY_TAG. This stampsan identifier on eachSQL statement untilitisrolled back, which is very important for system administrators. EachSQL statement (and QUERY_TAG) isloggedin the QUERY_HISTORYview, allowingyoutotrack job performance over time. Thisallows youto quicklyseewhen a task change hasdegradedperformance, identified inefficientconversion jobs,or indicated whenjobs need to run better in large or small warehouses.
11. Use Transient Tables for Intermediate Results:
During complex ELTpipelines,write intermediate results totransient tables that may be truncated before the next load. Thiscutsthetime travelstoragedownto just one day and avoids an additional 7 days offailsafestorage.Alwaysuse temporary tableswhere it makes sense. However, it isoftenusefultovalidatethe results of intermediate steps in complex ELTpipelines.
12. Avoid row-by-row processing:
Modern analytics platforms such as Snowflake are designed to ingest, process, and analyze billions of rows at incredible speed using simple SQL statements that react to each data set It has been. However, people tend to think in terms of row-by-row processing, and this can lead to programming loops where he fetches and updates one row at a time. Note that row-by-row processing is the biggest way to slow query performance. Use SQL statements to process all table entries atonceand avoid row-by-row processing at allcosts.
13. Follow standard ingestion patterns:
This involvesa multi-stepprocess ofstoringdata files in cloud storage and loadingthem into storage tablesbefore transforming the data. Breakingdowntheentireprocess intodefinedsteps makes it easier to orchestrate and test.
Conclusion:
In this blog, we have discussed few ways to deal with data using Snowflake. Above mentioned are the best practices one should follow for data ingestion, transformation, batch ingestion and processing, continuous data streaming, and data lake exploration.