Data Lake — S3 Buckets Consumer Guide
This guide explains how report data is prepared when we write it to the S3 buckets you access, and how you should read and use that data so it loads correctly (e.g. into your analytics tools).
1. How We Prepare the Data
Report files in the data lake are CSV (comma-separated values). We generate them using Amazon Redshift UNLOAD with the following options:
| Setting | Value | Description |
|---|---|---|
| Delimiter | Comma (,) |
Field separator. |
| Field enclosure | Double quote (") |
Every field is wrapped in double quotes. |
| Escape character | Backslash (\\) |
Special characters inside fields are escaped with \\. |
| Header | Yes | First row contains column names. |
| Encoding | UTF-8 | File encoding. |
1.1 Escaping Rules (Important)
Because we use backslash as the escape character, any comma, newline, quote, or backslash that appears inside a field is escaped as follows:
| Character in data | How it appears in the CSV |
|---|---|
Comma (,) |
\\, |
| Line feed | \\n |
| Carriage return | \\r |
Double quote (") |
\\" |
Backslash (\\) |
\\\\ |
So:
- Do not treat
\\nas a real newline when parsing — it is an escaped sequence representing a newline that was in the original value. - Do not split on every comma — commas inside quoted fields are written as
\\,. - A double quote inside a value is written as
\\", not as""(we do not use RFC 4180 double-quote escaping).
1.2 Example Rows
Header:
"column_a","column_b","column_c"
Data row (no special characters):
"value1","value2","value3"
Data row (comma and quote inside a field):
"value1","Hello, \\"World\\"","value3"
Here, the second column’s value is: Hello, "World".
Data row (newline in data shown as escaped):
"value1","Line1\\nLine2","value3"
The second column’s value is: Line1 + newline + Line2.
1.3 Null and Empty Values
- NULL in the source is written as an empty quoted field:
"". - Empty string is also
"".
Your loader may treat both as NULL or empty string depending on your rules.
2. How You Should Use the Data
To avoid bad records, misaligned columns, and “values spilling to the next line,” your reader must use the same format options we use when writing.
2.1 Required Parser Settings
When reading these CSV files, configure your parser as follows:
| Option | Use this value |
|---|---|
| Delimiter | , (comma) |
| Quote character | " (double quote) |
| Escape character | \\ (backslash) |
| Header | First row = column names |
| Encoding | UTF-8 |
2.2 Steps to Load the Data Correctly
-
Use a CSV parser that supports an escape character
Ensure the parser can use backslash (
\\) as the escape character, not only doubled double-quotes (""). -
Set delimiter, quote, and escape explicitly
- Delimiter: comma.
- Quote: double quote.
- Escape: backslash.Do not rely on “auto” or RFC 4180–only behavior if that would ignore the escape character.
-
Treat the first row as the header
Use it for column names; do not treat it as data.
-
Use UTF-8 when reading
Open or decode the file as UTF-8 to avoid issues with special characters.
-
Handle empty quoted fields
Decide how
""should be interpreted (e.g. NULL or empty string) in your schema or bronze layer.
2.3 Steps to Open the Files in Excel
Best practice: Import the file into Excel(do NOT double-click it)
- Open Excel
- Go to Data → Get Data → From Text/CSV
- Select the 000 file
- In the import dialog: • Delimiter: Comma • Text qualifier: " (double quote)
- Load the data
This forces Excel to respect proper CSV quoting rules.
The configuration should be:
- Set File Origin to (csv) UTF-8.
- Set Delimiter to Comma.
- Set fields to be enclosed by ' ” '
- Specify the escape character to ‘\\’
3. What to Avoid
- Do not assume RFC 4180 only (i.e. that the only escape inside quoted fields is
""). Our files use\\", so RFC 4180–only parsers will misparse and can produce bad records or “double quotes inside values” errors. - Do not assume one physical line = one record. Newlines inside fields are escaped as
\\n; only the escape-aware parser can reconstruct the full field. - Do not split on every comma. Commas inside fields are escaped as
\\,. - Do not ignore the escape character. Without
escape = \\, parsers will treat\\",\\n, and\\,incorrectly.
4. File Layout (Reference)
Report files are written under paths such as:
-
Shared / central reports:
s3://s3-data-layer/prod/reports/new/<report_name>_report/(e.g.
booking_report,flights_report,services_report, etc.) -
Per-operator data (BDS / data stream):
s3://fl3xx-data-lake/operator_<operatorid>/<report_name>/ -
Per-operator data lake buckets:
s3://fl3xx-data-lake-prod-operator-<id>/<report_name>/(e.g. Airports, booking_report, flights_report, services_report)
File names and prefixes may include part numbers (e.g. 000, 001) depending on the UNLOAD configuration. Use the same CSV options for all these files.
5. Summary
| We use | You should use |
|---|---|
Delimiter: comma (,) |
Same |
Fields enclosed in double quotes (") |
Same |
Escape character: backslash (\\) |
Same — required for correct parsing |
| First row = header | Same |
| Encoding: UTF-8 | Same |
Configuring your reader with delimiter comma, quote double quote, and escape backslash will align with how we prepare the data and should prevent bad records, misaligned columns, and incorrect handling of newlines and commas inside fields.