Observing Traffic Deviations in Production Logs
During a recent forensic analysis of a compromised TeamCity instance, I noticed that traditional signature-based WAF rules failed to trigger on the initial exploit phase of CVE-2024-27198. The attacker utilized URI path normalization anomalies—specifically /../ sequences—that appeared benign to standard filters but represented a significant statistical deviation from the application's normal traffic baseline. This highlighted a critical gap: our SIEM was looking for known bad strings rather than calculating the anomaly rank of the HTTP requests themselves.
HTTP Anomaly Rank (HAR) shifts the detection paradigm from binary pattern matching to a weighted scoring system. By assigning a numerical value to every attribute of an incoming request, we can surface "low-and-slow" attacks that bypass traditional thresholds. We started by extracting raw packet data to build our initial baseline, focusing on the relationship between request methods, header entropy, and payload size. For DevOps teams managing these environments, implementing secure SSH access for teams allows for centralized auditing of all CLI-based forensic activities.
Extracting HTTP metadata for baseline analysis using tshark
tshark -i eth0 -Y 'http.request' -T fields -e frame.time_epoch -e ip.src -e http.host -e http.request.method -e http.user_agent -e http.request.uri.query -E separator=, > http_traffic.csv
What is HTTP Anomaly Ranking?
An HTTP Anomaly Rank is a cumulative score representing how much a specific web request deviates from established norms. In our implementation, we treat every request as a vector of features. If a request uses an OPTIONS method on an endpoint that only ever sees GET or POST, its anomaly score increases. If that same request also lacks a Referer header and originates from an ASN associated with bulletproof hosting, the score crosses an actionable threshold in the SIEM.
We define the rank using a weighted sum of individual anomalies. For example, a missing User-Agent might add +5 to the score, while a high-entropy query string—suggestive of an encrypted C2 heartbeat or a SQL injection payload—might add +25. This allows us to prioritize alerts based on the severity of the deviation rather than treating all WAF logs as equal priority.
The Importance of Anomaly Detection in Modern Cybersecurity
In the context of the Indian digital landscape, particularly with the rollout of the Digital Personal Data Protection (DPDP) Act 2023, the cost of a data breach has moved beyond operational downtime to significant regulatory penalties (up to ₹250 crore). Traditional signature-based detection is insufficient for zero-day exploits where the signature is not yet known to vendors. Anomaly ranking provides a safety net by flagging the "weird" traffic that precedes a data exfiltration event.
We observed that attackers targeting Indian financial infrastructure often use "Header Smuggling." By injecting non-standard headers that are ignored by the frontend proxy but interpreted by the backend server, they can bypass authentication. Anomaly ranking identifies these non-standard headers immediately, even if the payload itself doesn't match a known exploit signature.
How SIEM Platforms Leverage HTTP Traffic Data
Modern SIEM platforms like Splunk, ELK, or Microsoft Sentinel act as the central nervous system for these scores. We don't just send the raw logs; we enrich them at the ingestion layer. By calculating the anomaly score at the edge (using a WAF or a custom Lua script in Nginx) and passing that score as a metadata field, the SIEM can perform real-time correlation across multiple requests from the same source IP.
I found that the most effective way to leverage SIEM is through "Time-Series Deviation." Instead of looking at a single request, we look at the average anomaly score of a source IP over a 5-minute sliding window. If the moving average spikes by 300%, the SIEM triggers an automated block via an API call to the perimeter firewall.
Establishing a Behavioral Baseline for Web Traffic
Before you can rank an anomaly, you must define "normal." We spent three weeks in "Learning Mode," capturing traffic patterns for our primary API endpoints. We discovered that 98% of our legitimate traffic followed a very specific header ordering and used a limited set of User-Agents. Any deviation from this ordering is a strong indicator of automated tooling or manual exploitation attempts.
To establish this baseline, we used a combination of Python scripts and SIEM aggregation. We calculated the frequency of every header key-value pair. We found that legitimate browser traffic from Indian ISPs (like Jio or Airtel) often includes specific headers related to mobile data optimization. If a request claims to be from a Chrome browser but lacks these expected ISP-injected headers, its anomaly rank is adjusted upward.
import pandas as pd from scipy.stats import zscore
Simple Python logic to identify payload size anomalies
df = pd.read_csv('http_traffic.csv', names=['ts', 'src', 'host', 'method', 'ua', 'query']) df['query_len'] = df['query'].str.len().fillna(0) df['z_score'] = zscore(df['query_len'])
Flag requests that are 3 standard deviations from the mean
anomalies = df[df['z_score'] > 3] print(anomalies[['src', 'query_len', 'z_score']])
Statistical Methods for Scoring Anomalous Requests
We utilize Z-score analysis for numerical features like payload size and request frequency. For categorical features like User-Agent or Accept-Language, we use Shannon Entropy. High entropy in a URI or a header value often suggests obfuscated shellcode or a base64-encoded payload. We've integrated these calculations into our ingestion pipeline using Logstash filters.
Another effective method is "Jaccard Similarity" for header sets. We compare the set of headers in an incoming request to the "Golden Set" of headers seen during the baseline period. If the Jaccard similarity index falls below 0.6, it indicates the request structure is significantly different from what our legitimate users send, warranting a higher anomaly rank.
Machine Learning vs. Rule-Based Detection in SIEM
While machine learning (ML) models like Isolation Forests are excellent for discovering unknown-unknowns, they often suffer from "Black Box" syndrome where security analysts don't understand why a request was flagged. In our SOC, we use a hybrid approach. We use ML to identify new clusters of anomalous traffic, but we translate those findings into human-readable scoring rules.
Rule-based detection remains the backbone of our operationalized HAR. It is predictable, low-latency, and easy to tune. For instance, we can explicitly weight the X-Forwarded-For header. In Indian infrastructure, many Tier-2 city ISPs use transparent proxies that inject multiple IPs into this header. A rule-based system allows us to account for this regional quirk without the ML model flagging every user from a specific city as an anomaly.
Analyzing Request Headers and Payload Sizes
The most common indicator of a web exploit is an oversized header. Attackers often use large headers to trigger buffer overflows or to smuggle payloads past simple regex filters. We tested our detection by sending a custom header with a 5000-character string of 'A's, mimicking a classic stack-smashing attempt.
Testing WAF response to oversized custom headers
curl -v -H "X-Custom-Header: $(python3 -c 'print("A"*5000)')" http://localhost/api/v1/resource
In our SIEM, this request would immediately receive an anomaly rank of +50. We also monitor the Content-Length vs. the actual body size. A mismatch here is a hallmark of HTTP Request Smuggling (HRS). By comparing these two values, we can identify attempts to detect HTTP desync attacks, a technique often used to steal session cookies from other users.
Monitoring Unusual User-Agent Strings and Methods
We've found that many automated scanners (like Nikto or Sqlmap) still use default User-Agent strings or strings that, while mimicking browsers, miss subtle details like the correct Sec-CH-UA (Client Hints) headers. A request that identifies as "Chrome 120" but lacks the corresponding Sec-CH-UA-Platform header is highly anomalous.
Furthermore, we monitor for "Method Probing." Legitimate users rarely use PUT, DELETE, or TRACE on a public-facing blog or login page. We've configured our SIEM to alert on any non-GET/POST method that originates from a non-administrative IP range. This is particularly effective for catching early-stage reconnaissance.
Geospatial and Temporal Traffic Patterns
Anomaly ranking must be context-aware. A spike in traffic at 3:00 AM IST from an IP range in a country where we have no business operations is an anomaly. However, if that same traffic occurs during business hours from a major Indian tech hub like Bengaluru or Hyderabad, the weight of the geospatial anomaly is reduced.
We also look at "Velocity Anomalies." If a single source IP is accessing 50 unique URIs per second, it is likely a scraper or a fuzzer. We use the SIEM's streamstats or aggregate functions to track these metrics in real-time. This temporal analysis is crucial for detecting DDoS attacks like the HTTP/2 Rapid Reset (CVE-2023-44487), which relies on a high frequency of RST_STREAM frames.
Reducing Alert Fatigue through Prioritization
One of the biggest challenges in any SOC is the sheer volume of WAF alerts. By implementing HTTP Anomaly Ranking, we moved away from alerting on every "SQL Injection Attempt" block. Instead, we only alert when the cumulative anomaly rank of a source IP exceeds 100 over a 10-minute period. This reduced our alert volume by 65% while increasing our detection of actual successful compromises.
We categorize alerts into three tiers:
- Tier 1 (Rank 20-50): Logged for correlation, no immediate alert.
- Tier 2 (Rank 51-90): Triggers a "Suspicious Activity" dashboard update and increases logging verbosity for that IP.
- Tier 3 (Rank 91+): Immediate P1 alert and automated mitigation.
Identifying Low-and-Slow Attacks and Data Exfiltration
Anomaly ranking is uniquely suited for detecting data exfiltration that mimics legitimate traffic. We monitor the Average Response Size per user. If a user who typically downloads 10KB of JSON data suddenly starts receiving 5MB responses, the anomaly rank spikes. This is how we detected a compromised internal account being used to scrape customer data—the requests were technically valid, but the volume was anomalous.
In the Indian context, where many organizations use shared hosting or Government Community Cloud (GCC) environments, identifying exfiltration is difficult because the outbound bandwidth is often shared. Anomaly ranking at the HTTP layer provides the granularity needed to see through the noise of the shared environment.
Implementing HTTP Anomaly Rank: A Step-by-Step Guide
To implement this, we started at the log ingestion layer. We used ModSecurity (OWASP Top 10 Core Rule Set) as our primary data generator. Instead of just blocking, we configured it to increment a score variable. We then exported this variable in the Nginx access logs in JSON format for easy ingestion into our SIEM.
ModSecurity snippet for cumulative anomaly scoring
SecRule REQUEST_HEADERS:User-Agent "@rx ^$" \ "id:10001,phase:1,pass,log,msg:'Empty User-Agent Anomaly',setvar:'tx.anomaly_score=+5'"
SecRule REQUEST_METHOD "!@rx ^(GET|POST|HEAD)$" \ "id:10002,phase:1,pass,log,msg:'Non-standard Method',setvar:'tx.anomaly_score=+10'"
SecRule TX:ANOMALY_SCORE "@ge 20" \ "id:10003,phase:2,pass,log,msg:'High HTTP Anomaly Rank Detected',tag:'SIEM_ALERT'"
Data Ingestion: Collecting Logs from WAFs and Proxies
We ensure that all ingress controllers in our Kubernetes clusters are logging the X-Request-ID and the custom anomaly score. This allows us to trace a single anomalous request from the edge through our microservices architecture. We use jq to filter these logs during local debugging sessions to identify which services are being targeted by high-rank requests.
Filtering Kubernetes ingress logs for high anomaly scores
kubectl logs -l app=ingress-nginx --tail=1000 | jq 'select(.status >= 400) | {remote_addr, request_id, anomaly_score: .config_anomaly_score}'
Configuring Thresholds and Risk Weights
Threshold setting is an iterative process. We started with conservative numbers and adjusted them based on false positive reports. We found that "Search Engine Bots" (Googlebot, Bingbot) often triggered high anomaly ranks because their behavior mimics scanners. We had to implement a "Whitelisted Anomaly" rule that checks the reverse DNS of the IP before finalizing the score.
We also assigned higher risk weights to sensitive endpoints like /api/v1/auth/login and /api/v1/payments. An anomaly on the homepage might get a weight of 1x, while the same anomaly on the payment gateway gets a weight of 5x. This ensures that our SIEM alerts are business-context aware.
Common Use Cases for HTTP Anomaly Detection
Zero-day exploits often involve "Impossible Header Combinations." For example, a request that includes both a Content-Length and a Transfer-Encoding: chunked header is frequently used in request smuggling. By ranking this combination as a high-severity anomaly, we can detect the exploit before a specific CVE signature is even released.
In the case of SQL injection, even if the attacker uses advanced obfuscation to bypass regex, the resulting URI often has a much higher character entropy than normal. By monitoring the "Entropy Rank" of URI parameters, we caught an attacker using a polyglot payload that had bypassed three different commercial WAFs.
Mitigating Distributed Denial of Service (DDoS) Attacks
Modern DDoS attacks are no longer just about volume; they are about application-layer exhaustion. The HTTP/2 Rapid Reset (CVE-2023-44487) is a prime example. It doesn't require high bandwidth, but it does require an anomalous frequency of stream resets. By ranking the RST_STREAM frame frequency per connection, we can identify and drop these connections at the edge before they overwhelm the backend.
For Indian enterprises, DDoS attacks often coincide with major events or financial year-end. During these periods, we increase the sensitivity of our temporal anomaly ranks. If the SIEM detects a cluster of IPs with high anomaly ranks all targeting the same resource, it triggers a "Shields Up" mode, automatically increasing the CAPTCHA difficulty for those specific IP ranges.
Challenges and Best Practices
The transition to encrypted traffic (HTTPS) means that your anomaly detection must happen at a point where the traffic is decrypted—either at the Load Balancer or the WAF. If you are using a network tap (like an IDS), you must ensure it has access to the private keys for SSL inspection, which introduces its own set of security risks and compliance hurdles under the DPDP Act.
Another challenge is "Dynamic Web Environments." Modern SPAs (Single Page Applications) often send complex JSON blobs that can look like anomalies to a poorly tuned system. We've learned to exclude the body of specific trusted API calls from entropy analysis, focusing instead on the structural metadata of the request.
Minimizing False Positives in Dynamic Web Environments
The key to minimizing false positives is "Contextual Whitelisting." In India, we noticed many users from rural areas use older browsers or custom proxy software that strips certain headers. If we were too aggressive with our "Missing Header" scoring, we would block legitimate customers. We solved this by creating "Regional Baselines" that account for the typical traffic profile of different geographical areas.
We also implemented a "Feedback Loop" in our SIEM. When an analyst marks an alert as a false positive, the SIEM automatically calculates which specific anomaly contributed most to that score and suggests a weight adjustment. This continuous tuning is essential for maintaining the system's credibility with the SOC team.
Quick check for 4xx/5xx errors which often correlate with high anomaly ranks
grep -E '40[0-9]|50[0-9]' /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr
The Shift Toward Predictive Security Analytics
The future of SIEM monitoring lies in moving from reactive alerting to predictive ranking. By analyzing the trajectory of an IP's anomaly rank over several hours, we can predict a breach attempt before the first exploit payload is ever sent. If an IP starts with low-rank reconnaissance (scanning non-existent files) and slowly moves toward high-rank activity (probing API parameters), the system can proactively blacklist the IP.
In India, as the CERT-In guidelines become more stringent regarding incident reporting, having a detailed "Anomaly Trail" for every incident is no longer optional. The ability to show exactly how a threat actor's behavior deviated from the norm provides the forensic evidence needed for both compliance and legal recourse.
Operationalizing HTTP Anomaly Rank isn't about replacing your WAF; it's about making your SIEM smarter. By treating every request as a data point in a statistical model, you gain visibility into the subtle, sophisticated attacks that signature-based systems are designed to miss. We continue to refine our weights and baselines, ensuring our defense-in-depth strategy remains resilient against an ever-evolving threat landscape.
For your next step, I recommend running the following command on your edge proxy logs to identify your top 10 most "entropy-heavy" URI paths—this is often where the most interesting anomalies are hiding.
awk -F\" '{print $2}' /var/log/nginx/access.log | cut -d' ' -f2 | awk '{print length, $0}' | sort -nr | head -n 10
