Inside SQream: Point in Polygon enablement

By Ohad Shalev

6.20.2023 twitter linkedin facebook

Companies can now query geo-location data in SQL. Any date back, any size. No limits

For telecom enterprises, accurate geolocation information is crucial for use cases like network planning, quality of service for customers, and personalized advertising. One important analytical operation that plays a key role in many of these use cases is the ‘point in polygon’ problem. By determining whether a given point (i.e. coordinate) is inside or outside a defined polygon (consisting of a minimum of three points), telecoms gain valuable insights into customer behavior so they can personalize their service and provide them additional value. 

 Discover how you can leverage GPU’s for complex analytics including geo-location data 

The Problem

One of SQream’s largest telecom customers in Asia (serving more than 40 million subscribers) was asking us to do this exact thing – they wanted to determine whether each user geolocation their network produces is inside one of the predetermined polygons. Combining the geolocation information with other customer data, the operator could target specific users with the appropriate and most relevant ads and promotions. 

Developing the industry-standard support for the required data types (GEOMETRY), which we didn’t have at that time, was estimated at four months of development by our R&D department. With no time to spare, two of our Solution Engineers stepped in and tried to implement the most suitable algorithm for the problem that would also bring into effect SQreamDB’s lightning-fast performance.      

The Solutions we tried

Before our talented Solution Engineers made the algorithm, they had to check which option would work best. First, they created two tables – one for the points (two columns for longitude and latitude- x,y) and the other for the polygons (one column consisting of a sequence of comma-separated points). Second, they tried two different solutions to implement the Ray Casting (or the Crossing Number or the Even-odd Rule) algorithm.

Using Javascript in the SQream Acceleration Studio: this was the easiest and quickest solution. The only problem with this Javascript solution was that it didn’t take advantage of SQream’s parallelism processing on complex queries. The algorithm was running row-by-row, was too slow, and it was limited to querying only inside SQream Acceleration Studio. 


{{
var x = x1, y = y1;
var polygon = [[x2, y2], [x3, y3], [x4, y4], [x5, y5]];
var inside = false;
for (var i = 0, j = polygon.length - 1; i < polygon.length; j = i++) {     var xi = polygon[i][0], yi = polygon[i][1];     var xj = polygon[j][0], yj = polygon[j][1];     var intersect = ((yi > y) != (yj > y))
&& (x < (xj - xi) * (y - yi) / (yj - yi) + xi);
if (intersect) inside = !inside;
}
return inside;
}}

 

Using Python UDFs: The Solution Engineers tried to take advantage of our Python connectivity in order to import the Shapely library, which can evaluate the topological relationships between geographical objects. They used a function called “contains()”, which checks if a polygon contains a point. Like the Javascript solution, the Python solution was also executing the algorithm row-by-row and not chunk-by-chunk, leaving them with poor performance.

 — only function: importing Python and Shapley library:


CREATE OR REPLACE FUNCTION test1(p_lat DOUBLE, p_lon double, polyg TEXT)
RETURNS BOOL
AS $$
from shapely.geometry import Point, Polygon
p = Point(p_lat, p_lon)
polyg1=polyg.replace('(','')
polyg2=tuple(polyg1.split('), '))
coords = tuple(tuple(map(float, tup.replace(')','').split(', '))) for tup in polyg2)
poly = Polygon(coords)
return poly.contains(p)

–examples of usage:

— only function:


select pip(24.976567, 60.1612500, '(24.950899, 60.169158), (24.953492, 60.169158), (24.953510, 60.170104), (24.950958, 60.169990)');
--using tables:
select g.grid_id, pip(g.grid_lat, g.grid_lon, pa.polygon_points) as pip_bool
from grid g join public_area2 pa on 1=1;

There’s Nothing like SQL

They ended up going back to our old favorite SQL, as SQreamDB was built to accelerate exactly those kinds of analytical problems. The challenge here was to convert the Javascript algorithm to SQL in a manner that will maintain all the relative advantages of SQream’s performance. So, the team decided to ask ChatGPT for help. After some unfit suggestions, The LLM chatbot gave the team the idea to use UNION ALL along with multiple complex JOINs for modifying the polygons table, and COUNT(*) to determine whether a point is inside or outside of a polygon. Utilizing this advice helped our engineers form the final query, which lead to astonishing performance!!!  


WITH cte as (
SELECT
b.grid_id as grid_id,
c1.objectid,
sum(CASE
WHEN c2.y - c1.y = 0 then 0 :: BOOL
else case
when (c1.y > b.y) != (c2.y > b.y) then (
b.x < (c2.x - c1.x) * (b.y - c1.y) / case
when (c2.y - c1.y) = 0 then 1
else (c2.y - c1.y)
end + c1.x
)
else 0 :: BOOL
end
end::int) AS sum_sign
FROM
lv_poly c1
JOIN lv_poly c2 ON c1.row_id = c2.row_id - 1
and c1.objectid = c2.objectid
join (
select g.objectid as grid_id, g.x, g.y, min_xy.objectid as pa_id  from (
SELECT
MIN(x) as min_x,
min(y) as min_y,
MAX(x) as max_x,
MAX(y) as max_y,
objectid
FROM
lv_poly c1
GROUP BY
objectid
) min_xy
join public.worship_places g on g.x > min_xy.min_x and g.x < min_xy.max_x and g.y > min_xy.min_y and g.y < min_xy.max_y) b on b.pa_id = c1.objectid
group by 1,2)
SELECT
grid_id, objectid
from
(
SELECT
grid_id,
objectid,
CASE
WHEN sum_sign :: INT % 2 != 0
THEN 1
ELSE 0
END as is_pip
FROM
CTE
GROUP BY
1, 2
)
WHERE
is_pip = 1;

The Results & What’s Next?

While testing the solution the team got incredible results – the SQL solution was 20X faster than the Python UDF, and successful compared to the Javascript. Furthermore, when scaling up to X24K the number of points, the performance remained excellent and up to our customer’s requirements.     

 

As we move forward to implement the SQL solution in our customer production environment, we are in a continuous effort to support various data types, such as ARRAY, Tuple, and Polygon. Running SQL natively on those data types will better improve the performance and efficiency of the developed algorithm.

While not created specifically to create code, Large Language Models (LLMs) such as ChatGPT certainly have the capabilities to do so. Like with any question asked to ChatGPT, it is the quality of the prompt which differentiates the good results from the bad. Here are some tips about using ChatGPT to write your code: 

  1. Prompt with context, but keep it short and precise. 
  2. Use ChatGPT to explore libraries and resources you are not familiar with.
  3. Start small and ask for examples before prompting it with big coding tasks.
  4. Sharpen your critical thinking skills, and be skeptical – there are still a lot of kinks to be worked out in this program.

——

 Discover how you can leverage GPU’s for complex analytics including geo-location data