Martin Rozariyo
Back to writeups

L3m0nCTF2025-Writeups

Challenge Overview: The Cicada Archives

Forensics/TheCicadaArchives/Readme.md

Challenge Overview: The Cicada Archives

Category: Forensics
Event: L3m0nCTF 2025
Role: Challenge Author

πŸ› οΈ Author Note
This challenge was authored by me for L3m0nCTF 2025.
The following explanation describes the intended multi-layer forensic analysis path.

image

Intended Analysis Path

The challenge was designed to test:

  • recognition of document formats as structured containers
  • detection of invisible data embedded in normal-looking files
  • correlation of unrelated forensic artifacts across multiple layers
  • reduction of large noisy datasets to isolate anomalies
  • reconstruction of a fragmented narrative from subtle clues

Direct or surface-level inspection of any single file was intentionally insufficient.

Analysis Phase 1 β€” Establishing the Scope

We are given a single archive TheCicadaArchives.tar.gz containing three files:

whiteletter.docx

archive_2021.bin

evidence.zip

At first glance, everything looks ordinary. No obvious corruption, no visible clues, no readable flags. This challenge is about looking past what’s visible, and understanding that data can be hidden inside structure, noise, and normality.

Analysis Phase 2 β€” Document Container Inspection

Opening the document normally reveals nothing interesting β€” just plain text. This immediately suggests the content is not meant to be read directly.

A .docx file is actually a ZIP archive, so we extract it:

unzip whiteletter.docx -d whiteletter

Inside, we inspect the Word XML files. The footer is a common hiding place:

whiteletter/word/footer.xml

image

Reading it visually still shows nothing suspicious.

Analysis Phase 3 β€” Hidden Unicode Signal Extraction

Since nothing visible stands out, the next step is to look for invisible or non-ASCII characters.

We search for zero-width Unicode characters:

grep -P "[\x{200C}\x{200D}]" footer.xml
image

Nothing visible is printed, but output exists β€” meaning invisible characters are present.

We extract only those characters:

grep -oP "[\x{200C}\x{200D}]" footer.xml > zw.txt

A quick hex dump confirms they are real:

 
image

We see repeating patterns of:

  • e2 80 8c β†’ U+200C
  • e2 80 8d β†’ U+200D

Analysis Phase 4 β€” Zero-Width Character Decoding

These two characters can naturally represent binary:

  • U+200C β†’ 0
  • U+200D β†’ 1

We decode them as binary bytes:

python3 - << 'EOF'
data = open("zw.txt", "r", encoding="utf-8").read()

bits = ""
for c in data:
    if c == '\u200c':
        bits += "0"
    elif c == '\u200d':
        bits += "1"

out = ""
for i in range(0, len(bits), 8):
    byte = bits[i:i+8]
    if len(byte) == 8:
        out += chr(int(byte, 2))

print(out)
EOF

Output:

morseindocx

This is clearly a password.

Brute-force attempts using common wordlists were intentionally ineffective.

Analysis Phase 5 β€” Password-Protected Artifact Recovery

Using the recovered password:

unzip evidence.zip

Password:

morseindocx

The archive extracts several files

Analysis Phase 6 β€” Network Traffic Correlation

Opening the capture in Wireshark shows heavy, realistic traffic:

  • DNS
  • TCP
  • ICMP
  • Multiple IPs and hosts

Nothing obvious stands out initially.

DNS Analysis

We filter DNS packets:

Among many legitimate domains, one entry stands out subtly:

frg3.tdn01s3s1gn4l.net

This does not look random β€” it looks constructed.

Extracting the fragment:

tdn01s3s1gn4l

This becomes the 3rd fragment.

Analysis Phase 7 β€” Secondary Signal Discovery

Still in the PCAP, we inspect HTTP traffic.

We follow TCP streams (Right-click β†’ Follow β†’ HTTP Stream).

One HTTP request contains a custom header:

image

This gives us the password:

inspectnext

Analysis Phase 8 β€” Image-Based Data Extraction

Using the recovered password:

steghide extract -sf img002.jpg -p inspectnext
image

This extracts:

fragment2.txt

Contents:

tt3r_1nsp3c

Analysis Phase 9 β€” Large-Scale Log Reduction

1.Initial Analysis

We are provided with a file named massive_server.log. A quick check using ls -lh reveals the file is quite large (approx. 150MB+), containing over 1 million lines.

Attempting to read the file manually using cat or less is futile because of the sheer volume of data. The logs simulate a busy server environment with various formats:

  • Apache/Nginx Access Logs
  • Syslog messages (kernel, sshd, cron)
  • JSON structured logs
  • Java Stack Traces
  • Hex Dumps

Since we don't know what string to search for (like "flag" or "L3M0N"), a simple grep won't work.

2. The Trap

A common first attempt is to look for unique lines using sort | uniq -u. However, running this command returns almost the entire file.

Why? Every log line contains dynamic variables:

  • Timestamps: [14:22:01] vs [14:22:02]
  • IP Addresses: 192.168.1.5 vs 10.0.0.2
  • Request IDs/UUIDs: trace_id: "b394a2f7..."

To a computer, these lines are all "unique," even if they are generated by the same logging event.

3. The Solution: Log Reduction

To find the needle, we don't look for the needle; we look for the haystack. We need to perform Frequency Analysis.

By identifying the "templates" that generate the noise, we can mathematically filter them out. If 99.9% of the file follows 5 standard patterns, the flag will be the one line that follows a pattern appearing only once.

We wrote a Python script to normalize the logsβ€”replacing all variables (numbers, IPs, UUIDs, dates) with generic placeholders like {VAR}.

The Solver Script (solve.py)

import re
from collections import Counter

# --- CONFIGURATION ---
LOG_FILE = "massive_server.log"  # Make sure this matches your file name

def normalize_log(line):
    """
    Aggressively strips variable data to reveal the 'skeleton' of the log.
    """
    line = line.strip()

    # 1. DETECT HEX DUMPS (The lines with | at the end)
    # If it starts with hex address and ends with ascii representation
    if re.search(r'^[0-9a-fA-F]{4,8}\s+[0-9a-fA-F]{2}', line) and "|" in line:
        return "HEX_DUMP_LINE"

    # 2. STRIP UUIDs (The long trace_id strings)
    # Pattern: 8-4-4-4-12 hex characters
    line = re.sub(r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}', '{UUID}', line)

    # 3. STRIP TIMESTAMPS & IPs
    line = re.sub(r'\d{4}-\d{2}-\d{2}', '{DATE}', line)
    line = re.sub(r'\d{2}:\d{2}:\d{2}', '{TIME}', line)
    line = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', '{IP}', line)

    # 4. AGGRESSIVE: Strip any word containing a number (e.g., "User123", "0x4f", "thread-5")
    # This turns your flag hash "28h3Jkh..." into "{VAR}"
    line = re.sub(r'\b\w*\d\w*\b', '{VAR}', line)

    # 5. Clean up multiple spaces
    return " ".join(line.split())

def solve():
    print(f"Scanning {LOG_FILE} with aggressive filters...")

    skeleton_counts = Counter()
    skeleton_examples = {}

    try:
        with open(LOG_FILE, "r", encoding="utf-8", errors="ignore") as f:
            for line in f:
                clean_line = line.strip()
                if not clean_line: continue

                # Get the skeleton
                skeleton = normalize_log(clean_line)

                # Count it
                skeleton_counts[skeleton] += 1

                # Save the first example we see of this type
                if skeleton not in skeleton_examples:
                    skeleton_examples[skeleton] = clean_line

    except FileNotFoundError:
        print(f"Error: Could not find '{LOG_FILE}'. Check the filename.")
        return

    print("\n--- RESULTS: The Rarest Log Entries ---")

    # Print the bottom 5 (rarest) items
    # The flag should be the very last one printed (Count: 1)
    found_any = False
    for skeleton, count in skeleton_counts.most_common()[:-6:-1]:
        found_any = True
        print(f"[Count: {count}]")
        print(f"Skeleton: {skeleton}")
        print(f"ORIGINAL: {skeleton_examples[skeleton]}")
        print("-" * 50)

    if not found_any:
        print("No results found. Is the file empty?")

if __name__ == "__main__":
    solve()

4.Execution & Result

Running the script produced the following output:

image

5. The Flag

The anomaly contained the hidden message:

can you see this 28h3JkhN8IVHxjDI4R8F5R

The encoded fragment can be identified as Base62 and decoded accordingly.

FRAG4: s_l0gg3d}

Analysis Phase 10 β€” Fragment Reassembly

We can get the first part of the flag on the initial analysis even using strings,

image
L3m0nCTF{wh1t3l3

On reconstructing it fully we get the flag.

Flag : L3m0nCTF{wh1t3l3tt3r_1nsp3ctdn01s3s1gn4ls_l0gg3d}